我们能在BeautifulSoup中使用XPath吗？

Question

我们能在BeautifulSoup中使用XPath吗？

pythonweb-scrapingxpathbeautifulsoupurllib

158

我正在使用BeautifulSoup来爬取一个URL，以下是我用来查找类为'empformbody'的td标签的代码：

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)

soup.findAll('td',attrs={'class':'empformbody'})

在上面的代码中，我们可以使用findAll获取标签和相关信息，但我想使用XPath。是否可以在BeautifulSoup中使用XPath？如果可能，请提供示例代码。

- Shiva Krishna Bavandla

10个回答

198

我可以确认，Beautiful Soup内部不支持XPath。

- Leonard Richardson

131

注：如果你点击进入他的用户资料，你会发现Leonard Richardson是Beautiful Soup的作者。 - senshin

40

能够在BeautifulSoup中使用XPATH会非常方便。 - DarthOpto

7

那么，替代方案是什么？ - static_rtti

20

2021年了，您是否仍在确认BeautifulSoup仍然不支持xpath？ - mshaffer

63

就像其他人所说，BeautifulSoup不支持xpath。有很多方法可以从xpath中获取内容，包括使用Selenium。但是，以下是适用于Python 2或3的解决方案：

from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

print('Buyers: ', buyers)
print('Prices: ', prices)

我参考了这个网页。

- wordsforthewise

一个警告：我注意到如果有些东西在根节点外（比如在最外层的<html>标签之外的\n字符），那么通过根节点引用xpath将无法正常工作，您必须使用相对xpath。 https://lxml.de/xpathxslt.html - wordsforthewise

Martijn的代码已经过时了（现在已经4年了...），etree.parse()行会打印到控制台，但不会将值分配给tree变量。这是一个相当大的说法。我肯定无法重现它，而且它也没有任何意义。你确定你正在使用Python 2来测试我的代码，或者已经将urllib2库用法转换为Python 3 urllib.request了吗？ - Martijn Pieters

是的，也许我在编写时使用了Python3，但它并没有按预期工作。只是测试过你的代码可以在Python2上运行，但Python3更受青睐，因为2020年2将不再得到官方支持。 - wordsforthewise

完全同意，但这里的问题使用Python 2。 - Martijn Pieters

24

BeautifulSoup有一个名为findNext的函数，用于查找当前元素之后的子元素，因此：

father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')

上述代码可以模拟以下XPath：

div[class=class_value]/div[id=id_value]

- user3820561

22

from lxml import etree
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('path of your localfile.html'),'html.parser')
dom = etree.HTML(str(soup))
print dom.xpath('//*[@id="BGINP01_S1"]/section/div/font/text()')

上面使用了Soup对象与lxml的结合，可以使用xpath提取值。

- Deepak rayathurai

4

当您使用lxml时，一切都很简单：

tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[@class="shared-components"]/@href')

但是当使用BeautifulSoup BS4时，所有的操作都很简单：

首先删除“//”和“@”
其次，在“=”之前添加星号

试试这个神奇的方法：

soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')

正如您所见，这不支持子标签，因此我删除了“/@href”部分。

- Oleksandr Panchenko

select() 是用于 CSS 选择器的，它根本不是 XPath。正如您所看到的，它不支持子标签。虽然我不确定那时是否正确，但现在肯定不是这样了。 - AMC

3

我已经搜索了他们的文档，似乎没有XPath选项。

此外，正如您可以在SO上的类似问题这里看到的那样，OP正在要求从XPath到BeautifulSoup的翻译，因此我的结论是-不，没有可用的XPath解析。

- Nikola

是的，实际上到目前为止我一直在使用Scrapy，它使用XPath来提取标签内的数据。这非常方便和容易获取数据，但我有一个需要使用BeautifulSoup来完成相同的任务，所以期待着尝试一下。 - Shiva Krishna Bavandla

1

也许你可以尝试以下方法，而不使用XPath。

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''
<html>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
'''
# What XPath can do, so can it
doc = SimplifiedDoc(html)
# The result is the same as doc.getElementByTag('body').getElementByTag('div').getElementByTag('h1').text
print (doc.body.div.h1.text)
print (doc.div.h1.text)
print (doc.h1.text) # Shorter paths will be faster
print (doc.div.getChildren())
print (doc.div.getChildren('p'))

- dabingsou

-1

这是一个相当古老的主题了，但现在有一个解决方案可以绕过去，这可能在BeautifulSoup发布之时还没有。

这是我所做的示例。我使用"requests"模块读取一个RSS源，并将其文本内容存储在变量“rss_text”中。通过BeautifulSoup，查找xpath /rss/channel/title，并检索其内容。虽然不完全是XPath（通配符，多个路径等），但如果你只想要定位基本路径，那么这种方法就可以奏效。

from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()

- David A

1

我认为这只能找到子元素。XPath 是另外一回事吗？ - robertspierre

这只是由BeautifulSoup提供的常规导航。 - sourcream

-5

使用 soup.find(class_='myclass')

- Γιωργος Αλεξανδρου

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martijn Pieters · Accepted Answer

不，BeautifulSoup本身不支持XPath表达式。

一个替代库，lxml，支持XPath 1.0。它有一个与BeautifulSoup兼容的模式，可以尝试解析破损的HTML。然而，lxml默认的HTML解析器同样能很好地解析破损的HTML，并且我相信速度更快。

一旦您将文档解析为lxml树，就可以使用.xpath()方法搜索元素。

try:
    # Python 2
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen
from lxml import etree

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)

此外还有一个专门的lxml.html()模块提供了额外的功能。

请注意，在上面的示例中，我直接将response对象传递给lxml，因为直接从流中读取解析器比先将响应读入大型字符串更有效。要使用requests库执行相同操作，您需要设置stream=True并在启用透明传输解压缩后传递response.raw对象，具体方法可以参考此处：

import lxml.html
import requests

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)

你可能会感兴趣的是 CSS Selector support；CSSSelector 类将 CSS 语句转换为 XPath 表达式，从而使查找 td.empformbody 变得更加容易:

from lxml.cssselect import CSSSelector

td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
    # Do something with these table cells.

回到起点：BeautifulSoup本身确实有非常完整的CSS选择器支持：

for cell in soup.select('table#foobar td.empformbody'):
    # Do something with these table cells.