漂亮汤嵌套标签搜索

Question

漂亮汤嵌套标签搜索

pythonhtmlbeautifulsoup

15

我正在尝试编写一个Python程序，用于统计网页上的单词数量。我使用Beautiful Soup 4来抓取页面，但是我在访问嵌套HTML标签方面遇到了困难（例如：在<div>内部的<p class="hello">）。

每次我尝试使用page.findAll()（page是包含整个页面的Beautiful Soup对象）方法查找这样的标签时，它都无法找到，尽管这些标签确实存在。有没有简单的方法或其他方法可以解决这个问题？

- Asafwr

1

请展示一些您尝试过的代码，并展示您想要抓取的页面。 - Anonta

4个回答

3

试试这个：

data = []
for nested_soup in soup.find_all('xyz'):
    data = data + nested_soup.find_all('abc')

也许你可以将其转化为 Lambda 表达式，使它更酷炫，但这个方法可行。谢谢。

- Maifee Ul Asad

0

您可以使用正则表达式（re模块）找到所有的<p>标签。请注意，r.content是一个包含整个网站HTML的字符串。

例如：

 r = requests.get(url,headers=headers)
 p_tags = re.findall(r'<p>.*?</p>',r.content)

这将获取所有的<p>标签，无论它们是否嵌套。如果您想要特定于<p>标签内的a标签，可以将整个标签作为第二个参数中的字符串添加，而不是使用r.content。

或者，如果您只想要文本，可以尝试以下方法：

from readability import Document #pip install readability-lxml
import requests
r = requests.get(url,headers=headers)
doc = Document(r.content)
simplified_html = doc.summary()

这将为您获取网站的更简化版本的HTML，然后继续解析。

- jayee

0

更新：我注意到文本并不总是返回预期的结果，同时我意识到有一种内置的方法可以获取文本，确实在阅读文档时我们发现有一个叫做get_text()的方法，使用它如下：

from bs4 import BeautifulSoup
fd = open('index.html', 'r')
website= fd.read()
fd.close()
soup = BeautifulSoup(website)
contents= soup.get_text(separator=" ")
print "number of words %d" %len(contents.split(" "))

不正确，请参考上文。假设您的HTML文件在本地名为index.html，您可以执行以下操作：

from bs4 import BeautifulSoup
import re
BLACKLIST = ["html", "head", "title", "script"] # tags to be ignored
fd = open('index.html', 'r')
website= fd.read()
soup = BeautifulSoup(website)
tags=soup.find_all(True) # find everything
print "there are %d" %len(tags)

count= 0
matcher= re.compile("(\s|\n|<br>)+")
for tag in tags:
if tag.name.lower() in BLACKLIST:
    continue
    temp = matcher.split(tag.text) # Split using tokens such as \s and \n
    temp = filter(None, temp) # remove empty elements in the list
    count +=len(temp)
print "number of words in the document %d" %count
fd.close()

请注意，由于格式错误、误报（它会检测任何单词，即使是代码）、使用JavaScript或CSS动态显示的文本或其他原因，可能不准确。

- Melardev

谢谢你，但我只想计算特定类别中<div>标签内的文本，而不是页面上的所有文本。 - Asafwr

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mario Kirov · Accepted Answer

也许我猜测你想做的是首先查找特定的 div 标签，然后搜索其中所有的 p 标签并计数，或做任何你想要的操作。例如：

soup = bs4.BeautifulSoup(content, 'html.parser') 

# This will get the div
div_container = soup.find('div', class_='some_class')  

# Then search in that div_container for all p tags with class "hello"
for ptag in div_container.find_all('p', class_='hello'):
    # prints the p tag content
    print(ptag.text)

希望这有所帮助。