如何为每个句子添加标签并自定义标记颜色？

Question

如何为每个句子添加标签并自定义标记颜色？

3

我正在使用Beautiful Soup和Requests加载网站的HTML（例如https://en.wikipedia.org/wiki/Elephant）。我想模仿这个页面，但我想给“p”标签（段落）中的句子上色。

为此，我正在使用spacy将文本分成句子。我选择一种颜色（对于那些感兴趣的人，这是基于二进制深度学习分类器的概率颜色）。

def get_colorized_p(p):
    
    doc = nlp(p.text) # p is the beautiful soup p tag
    string = '<p>'
    for sentence in doc.sents:
        # The prediction value in anything within 0 to 1.
        prediction = classify(sentence.text, model=model, pred_values=True)[1][1].numpy()
        # I am using a custom function to map the prediction to a hex colour.
        color = get_hexcolor(prediction)
        string += f'<mark style="background: {color};">{sentence.text} </mark> '
    string += '</p>'
    return string # I create a new long string with the markup

我用HTML标记创建了一个新的长字符串，其中包含p标记。现在我想替换beautiful soup对象中的“旧”元素。我通过简单的循环来实现这个目标：

for element in tqdm_notebook(soup.findAll()):
    if element.name == 'p':
        if len(element.text.split()) > 2: 
            element = get_colorized_p(element)

这并不会出现任何错误，但是当我渲染HTML文件时，HTML文件显示出来没有标记。

我正在使用Jupyter快速显示HTML文件：

from IPython.display import display, HTML

display(HTML(html_file))

然而这并不起作用。我通过get_colorized_p验证了返回的字符串。当我将其用于单个p元素并渲染时，它可以正常工作。但是我想将该字符串插入到beautiful soup对象中。

希望有人能够解决这个问题。在循环内替换元素时出现问题。但是，我不知道如何修复它。

以下是已呈现字符串示例的示例:

<p><mark style="background: #edf8fb;">Elephants are the largest existing land animals.</mark><mark style="background: #f1fafc;">Three living species are currently recognised: the African bush elephant, the African forest elephant, and the Asian elephant.</mark><mark style="background: #f3fafc;">They are an informal grouping within the proboscidean family Elephantidae.</mark><mark style="background: #f3fafc;">Elephantidae is the only surviving family of proboscideans; extinct members include the mastodons.</mark><mark style="background: #eff9fb;">Elephantidae also contains several extinct groups, including the mammoths and straight-tusked elephants.</mark><mark style="background: #68c3a6;">African elephants have larger ears and concave backs, whereas Asian elephants have smaller ears, and convex or level backs.</mark><mark style="background: #56ba91;">The distinctive features of all elephants include a long proboscis called a trunk, tusks, large ear flaps, massive legs, and tough but sensitive skin.</mark><mark style="background: #d4efec;">The trunk is used for breathing, bringing food and water to the mouth, and grasping objects.</mark><mark style="background: #e7f6f9;">Tusks, which are derived from the incisor teeth, serve both as weapons and as tools for moving objects and digging.</mark><mark style="background: #d9f1f0;">The large ear flaps assist in maintaining a constant body temperature as well as in communication.</mark><mark style="background: #e5f5f9;">The pillar-like legs carry their great weight.</mark><mark style="background: #72c7ad;"> </mark></p>

- Robert

2个回答

1

我喜欢这个想法和颜色搭配-在我看来，主要问题是您试图用一个字符串替换tag，而您应该使用replace_with()将一个bs4 object替换到您的soup中，从而使其味道更加丰富。

for element in tqdm_notebook(soup.find_all()):
    if element.name == 'p':
        if len(element.text.split()) > 2: 
            element.replace_with(BeautifulSoup(get_colorized_p(element), 'html.parser'))

将您的soup转换回字符串并尝试显示它：

display(HTML(str(soup)))

在更新的代码中避免使用旧语法findAll()，而是使用find_all() - 更多信息请花一分钟查看文档

示例

from bs4 import BeautifulSoup
from IPython.display import display, HTML

html = '''
    <p>Elephants are the largest ...</p>
'''
soup = BeautifulSoup(html, 'html.parser')

def get_colorized_p(element):
    ### processing and returning of result str
    return '<p><mark style="background: #edf8fb;">Elephants are the largest existing land animals.</mark><mark style="background: #f1fafc;">Three living species are currently recognised: the African bush elephant, the African forest elephant, and the Asian elephant.</mark><mark style="background: #f3fafc;">They are an informal grouping within the proboscidean family Elephantidae.</mark><mark style="background: #f3fafc;">Elephantidae is the only surviving family of proboscideans; extinct members include the mastodons.</mark><mark style="background: #eff9fb;">Elephantidae also contains several extinct groups, including the mammoths and straight-tusked elephants.</mark><mark style="background: #68c3a6;">African elephants have larger ears and concave backs, whereas Asian elephants have smaller ears, and convex or level backs.</mark><mark style="background: #56ba91;">The distinctive features of all elephants include a long proboscis called a trunk, tusks, large ear flaps, massive legs, and tough but sensitive skin.</mark><mark style="background: #d4efec;">The trunk is used for breathing, bringing food and water to the mouth, and grasping objects.</mark><mark style="background: #e7f6f9;">Tusks, which are derived from the incisor teeth, serve both as weapons and as tools for moving objects and digging.</mark><mark style="background: #d9f1f0;">The large ear flaps assist in maintaining a constant body temperature as well as in communication.</mark><mark style="background: #e5f5f9;">The pillar-like legs carry their great weight.</mark><mark style="background: #72c7ad;"> </mark></p>'

for element in soup.find_all():
    if element.name == 'p':
        if len(element.text.split()) > 2: 
            element.replace_with(BeautifulSoup(get_colorized_p(element), 'html.parser'))

display(HTML(str(soup)))

虽然不完全相同，但非常接近您问题中的行为：如何使用<b>标签包装每个单词的首字母？

- HedgeHog

1

这位先生，这是一个很好的答案。完美无缺。我确实看过replace_with，但从未想过再次将其包装为BS4对象！非常感谢你！ - Robert

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Banana · Accepted Answer

element = get_colorized_p(element) 赋值了一个本地变量，但随后该变量未被使用或被 for 循环变量覆盖。您需要保存处理后的元素，例如将它们连接成一个字符串。

html = ''
for element in tqdm_notebook(soup.findAll()):
    if element.name == 'p' and len(element.text.split()) > 2: 
        html += get_colorized_p(element)
    else:
        html += element.text

display(HTML(html))