使用Python解析HTML

Question

使用Python解析HTML

256

我正在寻找一个Python的HTML解析器模块，可以帮助我将标签以Python列表/字典/对象的形式获取。

如果我有一个如下形式的文档：

<html>
<head>Heading</head>
<body attr1='val1'>
    <div class='container'>
        <div id='class'>Something here</div>
        <div>Something else</div>
    </div>
</body>
</html>

那么它应该给我一种访问嵌套标签的方式，通过 HTML 标签的名称或 ID，以便我基本上可以要求它获取包含在 body 标签中，并具有 class='container' 的 div 标签中的内容/文本，或类似的东西。

如果您使用过 Firefox 的“检查元素”功能（查看 HTML），则会发现它以漂亮的嵌套方式呈现所有标签，就像一个树形结构。

我更喜欢内置模块，但这可能要求有点过多了。

我浏览了 Stack Overflow 上的很多问题和一些博客，其中大部分建议使用 BeautifulSoup、lxml 或 HTMLParser，但其中很少有关于功能的详细说明，简单地结束为哪一个更快/更有效率的辩论。

- ffledgling

3

像其他回答者一样，我建议使用BeautifulSoup，因为它在处理损坏的HTML文件方面非常出色。 - Pascal Rosin

7个回答

106

我猜你想要的是类似于jQuery的Python库——pyquery:

pyquery：一个类似于jQuery的Python库。

一个示例可能如下：

from pyquery import PyQuery    
html = # Your HTML CODE
pq = PyQuery(html)
tag = pq('div#id') # or     tag = pq('div.class')
print tag.text()

它使用与Firefox或Chrome的检查元素相同的选择器。例如：

the element selector is 'div#mw-head.noprint'

检查的元素选择器是'div#mw-head.noprint'。因此在pyquery中，您只需要传递此选择器：

pq('div#mw-head.noprint')

- YusuMishi

对于从jQuery前端来的人非常有用！ - Jay Dadhania

3

备注：此库在内部使用lxml。 - user202729

50

在这里，您可以了解有关Python中不同HTML解析器及其性能的更多信息。尽管该文章有点过时，但仍可以为您提供很好的概述。

Python HTML解析器性能

我建议使用BeautifulSoup，即使它不是内置的。只是因为它对于这些任务非常容易使用。例如：

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://www.google.com/')
soup = BeautifulSoup(page)

x = soup.body.find('div', attrs={'class' : 'container'}).text

- Qiau

2

我正在寻找详细介绍功能和特性的内容，而不是性能和效率方面的。编辑：抱歉之前回答得太急了，那个链接实际上很好。谢谢。 - ffledgling

第一个点列出了功能和特性的概述 :) - Qiau

9

如果您使用BeautifulSoup4（最新版本）：from bs4 import BeautifulSoup - Franck Dernoncourt

解析器性能文章已经移动（尽管它来自2008年，因此可能已过时）到：https://ianbicking.org/blog/2008/03/python-html-parser-performance.html - kristianp

35

相比其他解析库，lxml非常快：

而且使用 cssselect 也很容易用于网页抓取：

from lxml.html import parse
doc = parse('http://www.google.com').getroot()
for div in doc.cssselect('a'):
    print '%s: %s' % (div.text_content(), div.get('href'))

lxml.html 文档

- Lenar Hoyt

不支持HTTPS。 - Sergio

@Sergio 使用 import requests，将缓冲区保存到文件：https://dev59.com/qWYq5IYBdhLWcg3w5Ui8#14114741（或使用 urllib），然后使用 parse 加载保存的文件，doc = parse('localfile.html').getroot()。 - Guilherme Nascimento

3

我解析大型HTML文件以获取特定数据。使用BeautifulSoup花费了1.7秒，但是改用lxml后，速度提高了近*100倍！如果关心性能，lxml是最佳选择。 - Alex-Bogdanov

另一方面，lxml带有一个12MB的C扩展。大多数情况下不重要，但在某些情况下可能会影响你的操作（极少数情况）。 - user202729

11

我推荐使用lxml来解析HTML。请参阅"Parsing HTML"（在lxml网站上）。

根据我的经验，Beautiful Soup在处理一些复杂的HTML时会出现错误。我认为这是因为Beautiful Soup不是一个解析器，而是一个非常好的字符串分析器。

- Love and peace - Joe Codeswell

4

据我所知，Beautiful Soup 可以与大多数后端 XML 解析器配合使用，lxml 似乎是其中一个受支持的解析器。详情请见 http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser。 - ffledgling

@ffledgling，BeautifulSoup 的一些函数相当缓慢。 - Lenar Hoyt

2

我建议使用justext库：

https://github.com/miso-belica/jusText

用法：Python2：

import requests
import justext

response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print paragraph.text

Python3:

import requests
import justext

response = requests.get("http://bbc.com/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print (paragraph.text)

- Wesam Na

0

我会使用EHP

https://github.com/iogf/ehp

这是它：

from ehp import *

doc = '''<html>
<head>Heading</head>
<body attr1='val1'>
    <div class='container'>
        <div id='class'>Something here</div>
        <div>Something else</div>
    </div>
</body>
</html>
'''

html = Html()
dom = html.feed(doc)
for ind in dom.find('div', ('class', 'container')):
    print ind.text()

输出：

Something here
Something else

- Unknown Soldier

8

请解释一下。相较于流行的BeautifulSoup或lxml，你会使用EHP做什么？ - ChaimG

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Aadaam · Accepted Answer

我想让它帮我获取在body标签内具有class='container'的div标签中的内容/文本，或类似的东西。

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)

我猜你不需要性能描述，只需阅读BeautifulSoup的工作原理。请查看它的官方文档。