基本内容
解析器
基本元素
标签下行遍历
上行遍历
平行遍历
01
解析的HTML内容
html_doc = """
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.
...
soup解析后的信息
02
使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
print(soup.prettify())
#
#
#
# The Dormouse's story
#
#
#
#
#
# The Dormouse's story#
#
#
# Once upon a time there were three little sisters; and their names were
#
# Elsie#
# ,
#
# Lacie
#
# and
#
# Tillie
#
# ; and they lived at the bottom of a well.
#
#
# ...
# # #
简单的浏览结构化数据的方法
03
soup.title
# The Dormouse's story
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p# The Dormouse's story
soup.p['class']
# u'title'
soup.a
# Elsie
soup.find_all('a')
# [Elsie,
# Lacie,
# Tillie]
soup.find(id="link3")
# Tillie
从文档中找到所有标签的链接:
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
从文档中获取所有文字内容:
print(soup.get_text())
# The Dormouse's story#
# The Dormouse's story#
# Once upon a time there were three little sisters; and their names were# Elsie,
# Lacie and# Tillie;
# and they lived at the bottom of a well.
#
# ...
传入一段字符串或一个文件句柄
04
将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("data")
首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码,Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档
BeautifulSoup("Sacré bleu!")
Sacré bleu!
具体的方法有很多,只把最基本的放在这里,供使用时做提醒,更多方法请从下方阅读原文查看手册,上述代码源于中国大学MOOC上的Python网络爬虫与信息提取课程。
点击“阅读原文” |
|