Python之BeautifulSoup美味汤

基本内容

解析器

基本元素

标签下行遍历

上行遍历

平行遍历

01
解析的HTML内容

html_doc = """

      The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.
...

soup解析后的信息
02

使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
print(soup.prettify())
#
#
#
# The Dormouse's story
#
#
#
#
#
#    The Dormouse's story#
#
#
# Once upon a time there were three little sisters; and their names were
#
#    Elsie#
# ,
#
#    Lacie
#
# and
#
#    Tillie
#
# ; and they lived at the bottom of a well.
#
#
# ...
# #  #

简单的浏览结构化数据的方法
03

soup.title
# The Dormouse's story
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p# The Dormouse's story
soup.p['class']
# u'title'
soup.a
# Elsie
soup.find_all('a')
# [Elsie,
#  Lacie,
#  Tillie]
soup.find(id="link3")
# Tillie
从文档中找到所有标签的链接:
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
从文档中获取所有文字内容:
print(soup.get_text())
# The Dormouse's story#
# The Dormouse's story#
# Once upon a time there were three little sisters; and their names were# Elsie,
# Lacie and# Tillie;
# and they lived at the bottom of a well.
#
# ...

传入一段字符串或一个文件句柄
04

将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("data")
首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码，Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档
BeautifulSoup("Sacré bleu!")

         Sacré bleu!

具体的方法有很多，只把最基本的放在这里，供使用时做提醒，更多方法请从下方阅读原文查看手册，上述代码源于中国大学MOOC上的Python网络爬虫与信息提取课程。
点击“阅读原文”