Python，从网页中提取文本

Question

Python，从网页中提取文本

pythonhtmlparsingweb-scrapingweb-crawler

3

我正在开发一个项目，需要爬取数千个网站以提取文本数据，最终用例是自然语言处理。

编辑*由于我要爬取数十万个网站，所以我不能为每个网站编写特定的爬虫代码，这意味着我无法搜索特定元素ID，我正在寻找的解决方案是通用的。*

我知道有一些解决方案，例如beautiful soup中的.get_text()函数。但是这种方法的问题在于它获取了网站上的所有文本，其中许多与该特定页面上的主题无关。大多数情况下，一个网页将专门用于单个主题，但在侧边栏、顶部和底部可能会有其他主题、促销或其他内容的链接或文本。

使用.get_text()函数会一次返回网站页面上的所有文本。问题在于它将所有相关部分与不相关部分组合在一起。是否有另一个类似于.get_text()的函数，可以返回所有文本，但作为列表，并且每个列表对象都是文本的特定部分，这样就可以知道新主题从哪里开始和结束了。

此外，是否有一种方法可以识别网页上的主体文本？

- Mustard Tiger

1

也许你可以尝试使用正则表达式来获取所需的链接。 - anveshjhuboo

@MustardTiger，你尝试过使用find_all吗？它允许按标签和属性搜索元素，然后调用text。 - sushanth

2个回答

0

你所寻找的最贴近现实生活的例子可能就是 Firefox、Safari 和其他浏览器中的阅读模式，它也被称为阅读视图。

在 StackOverflow 上有一个关于该主题的问题：Firefox 阅读视图是如何操作的

据说 Firefox 依赖于 github.com/mozilla/readability，而他们很慷慨地开源了这个项目。

- ᴍᴇʜᴏᴠ

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- YATIN GUPTA · Accepted Answer

以下是使用BeautifulSoup4和Python3查询数据的示例代码：

import requests
from bs4 import BeautifulSoup

response = requests.get('https://yoursite/page')
soup = BeautifulSoup(response.text, 'html.parser')
# Print the body content in list form
print(soup.body.contents[0])
# Print the first found div on html page
print(soup.find('div'))
# Print the all divs on html page in list form
print(soup.find_all('div'))
# Print the element with 'required_element_id' id
print(soup.find(id='required_element_id'))
# Print the all html elements in list form that matches the selectors
print(soup.select(required_css_selectors))
# Print the attribute value in list form
print(soup.find(id='someid').get("attribute-name"))
# You can also break your one large query into multiple queries
parent = soup.find(id='someid')
# getText() return the text between opening and closing tag
print(parent.select(".some-class")[0].getText())

如果您有更高级的需求，您也可以查看Scrapy。如果您在实施过程中遇到任何挑战或者您的需求是其他方面的，请告诉我。