如何使用BeautifulSoup从HTML中获取文本而忽略格式标签？

Question

如何使用BeautifulSoup从HTML中获取文本而忽略格式标签？

3

以下代码用于从html中获取连续的文本段落。

    for text in soup.find_all_next(text=True):
        if isinstance(text, Comment):
            # We found a comment, ignore
            continue
        if not text.strip():
            # We found a blank text, ignore
            continue
        # Whatever is left must be good
        print(text)

文本项由结构标签（如<div>或<br>）和格式化标签（如<em>和<strong>）分隔。这使得我在进一步解析文本时有些不便，我希望能够抓取连续的文本项，同时忽略文本内部的任何格式化标签。

例如，soup.find_all_next(text=True)将获取html代码<div>This is <em>important</em> text</div>并返回一个字符串This is important text，而不是三个字符串This is、important和text。

我不确定这是否清楚... 如果不清楚，请告诉我。

编辑：我遍历html文本项的原因是，我只在看到特定的“开始”注释标记后才开始遍历，并且当我达到特定的“结束”注释标记时停止。有没有解决方案可以在需要逐项遍历的上下文中工作？我正在使用的完整代码如下。

soup = BeautifulSoup(page)
for instanceBegin in soup.find_all(text=isBeginText):
    # We found a start comment, look at all text and comments:
    for text in instanceBegin.find_all_next(text=True):
        # We found a text or comment, examine it closely
        if isEndText(text):
            # We found the end comment, everybody out of the pool
            break
        if isinstance(text, Comment):
            # We found a comment, ignore
            continue
        if not text.strip():
            # We found a blank text, ignore
            continue
        # Whatever is left must be good
        print(text)

当两个函数isBeginText(text)和isEndText(text)返回true时，表示传递给它们的字符串与我的起始或结束注释标签匹配。

- wrkyle

当你遇到两个嵌套的块级标签的情况时，你希望如何处理？比如说 <div>A<p>B</p>C</div>。你想要什么结果呢？不管怎样，我认为你应该检查当前标签是否有任何子元素。如果有，递归地检查这些子元素是否属于“格式化”类型（注意这是主观的：你认为em是其中之一，但不包括br），如果是的话，删除格式化标签，但保留内部HTML内容。也许我没有完全理解你的问题，但这样做不就解决了你的问题吗？ - Oliver W.

是的，我明白你的意思。实际上，除了保留基本的句子结构之外，我并不关心任何格式。只要句子保持完整（即单词不会被挤在一起），我就可以忽略<br>、<p>等标签。我知道soup.get_text()方法，但我不确定如何将其应用到我特定的开始和结束标签约束条件中（请参见我原始问题的编辑）。 - wrkyle

@OliverW。没错：起始标签是一个注释标签 ，结束标签也是一个注释标签 。我想要这两个注释标签之间的所有文本。如果有换行或者断行，只要保留完整的句子和单词，我就可以用空格替换它们。 - wrkyle

2个回答

2

如何使用 find_all_next 两次，一次用于开始标签，一次用于结束标签，并对生成的两个列表求差集？

作为示例，我将使用修改过的 html_doc 版本，该版本来自 BeautifulSoup 的文档:

import bs4

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<!-- START--><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><!-- END -->

<p class="story">...</p>
"""

soup = bs4.BeautifulSoup(html_doc, 'html.parser')
comments = soup.findAll(text=lambda text:isinstance(text, bs4.Comment))

# Step 1: find the beginning and ending markers
node_start = [ cmt for cmt in comments if cmt.string == " START" ][0]
node_end = [ cmt for cmt in comments if cmt.string == " END " ][0]

# Step 2, subtract the 2nd list of strings from the first
all_text = node_start.find_all_next(text=True)
all_after_text = node_end.find_all_next(text=True)

subset = all_text[:-(len(all_after_text) + 1)]
print(subset)

# ['Lacie', ' and\n', 'Tillie', ';\nand they lived at the bottom of a well.']

- Oliver W.

给我几分钟来尝试一下这个。 - wrkyle

这个工作得非常好！哇。我不得不对它进行一些修改，以便将其压缩到我的代码中，但输出不仅保留了句子结构，还保留了格式。如果我想的话，我可以直接将其写入文件。谢谢伙计！唯一的问题是它会读取注释和文本。在 find_all_next 方法中，能否排除注释并进行搜索？ - wrkyle

也许我可以使用 extract 方法从所选的文本主体中提取注释（除了我需要的两个）？我会尝试一下。 - wrkyle

搞定了。在我们使用soup.findAll列出评论后，我添加了一行代码[cmt.extract() for cmt in comments if cmt.string != start and cmt.string != end]，其中start和end是起始和结束评论的字符串。今晚我学到了一些新东西。你真是个巫师，感谢你抽出时间！ - wrkyle

我想提供一个可能的替代方案，以供未来查看此问题的用户参考（在 Oliver W. 提供优秀解决方案之前我使用的方法），那就是 Aaron Swartz 的 html2text。它能够将 HTML 转换成 Markdown 文本，并且效果非常好，但并不完全适合我在这里所做的工作。仅供参考。 - wrkyle

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- chafreaky · Accepted Answer

如果你获取包含子元素的父元素并使用 get_text()，BeautifulSoup 将为您剥离所有 HTML 标签，只返回文本的连续字符串。

你可以在这里找到一个示例 here。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.get_text())