从HTML标签和纯文本（不包含标签）中提取文本

Question

从HTML标签和纯文本（不包含标签）中提取文本

3

<p class="qotCJE">
<a href="https://ejje.weblio.jp/content/to+pay" title="to payの意味" class="crosslink">to pay</a> 
<a href="https://ejje.weblio.jp/content/charges" title="chargesの意味" class="crosslink">charges</a> 
from one's 
<a href="https://ejje.weblio.jp/content/bank+account" title="bank accountの意味" class="crosslink">bank account</a>
</p>

我将尝试重构这个被拆分成上述HTML代码的句子：“从某人的银行账户支付费用”。我的问题是句子的一部分没有包含在HTML标签中。当我尝试使用以下内容时：

BeautifulSoup.find_all()

我只能获取链接标签之间的文本，当我尝试使用时。

BeautifulSoup.contents

我只得到“from one's”，但没有链接标签中间的文本。

有没有办法遍历这段代码并重构句子？

编辑： 上面的代码只是一个例子，我正在尝试爬取一个词典，因此字符串的顺序和哪些部分将在标签内/外是任意的。

- BluNova897

尝试使用soup.text。 - Alex Hall

你可以尝试使用如下所述的 "get_text()" 函数： https://dev59.com/HmQo5IYBdhLWcg3wQNb4 - Luc

@Luc，这并没有给我想要的结果。当我使用get_text()时，我确实得到了标签内的所有文本，但我仍然缺少不在<a>标签内的部分。 - BluNova897

你需要将.text或者get_text()应用到<p>标签上，而不是<a>标签。 - Alex Hall

3个回答

1

编辑： 在深入研究词典网站后，我想到了以下解决方案。在每个句子的<p>标签下，我们可以进行以下操作：

from bs4.element import Tag
from bs4.element import NavigableString


res = []

for segment in p.contents:
    if isinstance(segment, NavigableString):
        res.append(segment)
    elif isinstance(segment, Tag):
        res.append(segment.text)

final_sentence = ''.join(res[:-2])

希望它有所帮助。

如果你只想从 title 属性中提取文本，可以这样做：

# assuming text is the html text given above
soup = BeautifulSoup(text, 'html5lib')
a_tags = soup.select('a')
a_strs = (a['title'] for a in a_tags)
final_sentence = "{} {} from one's {}".format(a_strs)

- ujhuyz0110

没错，但我正在尝试为通用响应执行此操作（我正在爬取网络词典），因此插入静态字符串的方法对我无效。也许我应该在我的帖子中澄清这一点。 - BluNova897

0

另一种实现相同效果的方法：

from bs4 import BeautifulSoup

content = """
<p class="qotCJE">
<a href="https://ejje.weblio.jp/content/to+pay" title="to payの意味" class="crosslink">to pay</a>
<a href="https://ejje.weblio.jp/content/charges" title="chargesの意味" class="crosslink">charges</a>
from one's
<a href="https://ejje.weblio.jp/content/bank+account" title="bank accountの意味" class="crosslink">bank account</a>
</p>
"""
soup = BeautifulSoup(content,"lxml")
print(soup.get_text(" ",strip=True))

输出：

to pay charges from one's bank account

- SIM

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alex Hall · Accepted Answer

from bs4 import BeautifulSoup

html = """<p class="qotCJE">
<a href="https://ejje.weblio.jp/content/to+pay" title="to payの意味" class="crosslink">to pay</a>
<a href="https://ejje.weblio.jp/content/charges" title="chargesの意味" class="crosslink">charges</a>
from one's
<a href="https://ejje.weblio.jp/content/bank+account" title="bank accountの意味" class="crosslink">bank account</a>
</p>"""

soup = BeautifulSoup(html)

print(soup.text)
# to pay
# charges
# from one's
# bank account

print(soup.text.replace('\n', ' '))
# to pay charges from one's bank account