使用Python从HTML中提取可读文本？

Question

使用Python从HTML中提取可读文本？

4

我知道类似html2text、BeautifulSoup等工具，但问题在于它们也会提取JavaScript并将其添加到文本中，使得难以分离。

htmlDom = BeautifulSoup(webPage)

htmlDom.findAll(text=True)

交替地，

from stripogram import html2text
extract = html2text(webPage)

这两种方法都会提取页面上的所有JavaScript代码，这是不必要的。

我只想提取可读文本，可以从浏览器复制。

- demos

4个回答

0

使用BeautifulSoup，类似以下代码：

def _extract_text(t):
    if not t:
        return ""
    if isinstance(t, (unicode, str)):
        return " ".join(filter(None, t.replace("\n", " ").split(" ")))
    if t.name.lower() == "br": return "\n"
    if t.name.lower() == "script": return "\n"
    return "".join(extract_text(c) for c in t)
def extract_text(t):
    return '\n'.join(x.strip() for x in _extract_text(t).split('\n'))
print extract_text(htmlDom)

- Forrest Voight

0

您可以在Beautiful Soup中删除脚本标签，类似于以下内容：

for script in soup("script"):
    script.extract()

移除元素

- jkyle

看起来是一个快速的解决方案，但标签提取的惩罚是什么？ - demos

0

试一下：

http://code.google.com/p/boilerpipe/

http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/

- saravanan

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alex Martelli · Accepted Answer

如果您想使用BeautifulSoup避免提取任何script标签的内容，

nonscripttags = htmlDom.findAll(lambda t: t.name != 'script', recursive=False)

我会为您做到这一点，获取根节点的非脚本标签的直接子节点（另外一个 htmlDom.findAll(recursive=False, text=True) 将获取根节点的直接子字符串）。您需要递归地完成此操作；例如，作为生成器：

def nonScript(tag):
    return tag.name != 'script'

def getStrings(root):
   for s in root.childGenerator():
     if hasattr(s, 'name'):    # then it's a tag
       if s.name == 'script':  # skip it!
         continue
       for x in getStrings(s): yield x
     else:                     # it's a string!
       yield s

我正在使用childGenerator（而不是findAll）来获取所有子元素并进行自己的筛选。