如何为每个句子添加标签并自定义标记颜色?

3
我正在使用Beautiful Soup和Requests加载网站的HTML(例如https://en.wikipedia.org/wiki/Elephant)。我想模仿这个页面,但我想给“p”标签(段落)中的句子上色。
为此,我正在使用spacy将文本分成句子。我选择一种颜色(对于那些感兴趣的人,这是基于二进制深度学习分类器的概率颜色)。
def get_colorized_p(p):
    
    doc = nlp(p.text) # p is the beautiful soup p tag
    string = '<p>'
    for sentence in doc.sents:
        # The prediction value in anything within 0 to 1.
        prediction = classify(sentence.text, model=model, pred_values=True)[1][1].numpy()
        # I am using a custom function to map the prediction to a hex colour.
        color = get_hexcolor(prediction)
        string += f'<mark style="background: {color};">{sentence.text} </mark> '
    string += '</p>'
    return string # I create a new long string with the markup

我用HTML标记创建了一个新的长字符串,其中包含p标记。现在我想替换beautiful soup对象中的“旧”元素。我通过简单的循环来实现这个目标:

for element in tqdm_notebook(soup.findAll()):
    if element.name == 'p':
        if len(element.text.split()) > 2: 
            element = get_colorized_p(element)

这并不会出现任何错误,但是当我渲染HTML文件时,HTML文件显示出来没有标记。
我正在使用Jupyter快速显示HTML文件:
from IPython.display import display, HTML

display(HTML(html_file))

然而这并不起作用。我通过get_colorized_p验证了返回的字符串。当我将其用于单个p元素并渲染时,它可以正常工作。但是我想将该字符串插入到beautiful soup对象中。

希望有人能够解决这个问题。在循环内替换元素时出现问题。但是,我不知道如何修复它。

以下是已呈现字符串示例的示例:

<p><mark style="background: #edf8fb;">Elephants are the largest existing land animals.</mark><mark style="background: #f1fafc;">Three living species are currently recognised: the African bush elephant, the African forest elephant, and the Asian elephant.</mark><mark style="background: #f3fafc;">They are an informal grouping within the proboscidean family Elephantidae.</mark><mark style="background: #f3fafc;">Elephantidae is the only surviving family of proboscideans; extinct members include the mastodons.</mark><mark style="background: #eff9fb;">Elephantidae also contains several extinct groups, including the mammoths and straight-tusked elephants.</mark><mark style="background: #68c3a6;">African elephants have larger ears and concave backs, whereas Asian elephants have smaller ears, and convex or level backs.</mark><mark style="background: #56ba91;">The distinctive features of all elephants include a long proboscis called a trunk, tusks, large ear flaps, massive legs, and tough but sensitive skin.</mark><mark style="background: #d4efec;">The trunk is used for breathing, bringing food and water to the mouth, and grasping objects.</mark><mark style="background: #e7f6f9;">Tusks, which are derived from the incisor teeth, serve both as weapons and as tools for moving objects and digging.</mark><mark style="background: #d9f1f0;">The large ear flaps assist in maintaining a constant body temperature as well as in communication.</mark><mark style="background: #e5f5f9;">The pillar-like legs carry their great weight.</mark><mark style="background: #72c7ad;"> </mark></p>
2个回答

1

element = get_colorized_p(element) 赋值了一个本地变量,但随后该变量未被使用或被 for 循环变量覆盖。您需要保存处理后的元素,例如将它们连接成一个字符串。

html = ''
for element in tqdm_notebook(soup.findAll()):
    if element.name == 'p' and len(element.text.split()) > 2: 
        html += get_colorized_p(element)
    else:
        html += element.text

display(HTML(html))

我理解你的观点。然而,这样做会丢弃原始文件的一半(我不想这样做)。我尝试过这种方法,并将不符合条件的字符串作为str(element)添加到HTML字符串中,但由于某种原因,在这种情况下<marks>没有被呈现出来。原始文件被呈现了。也许<mark>需要在HTML文件的开头声明?我认为我对HTML不够熟悉。 - Robert
我现在也在我的代码中添加了不合格字符串的处理。但是我无法测试我的代码。如果可能的话,您能否提供一个[mre] / 您的完整代码(如果足够短)/ 输出(打印“html”)? - Banana
1
我运行了你的代码,但不幸的是它没有起作用,它没有显示<mark>。尽管如此,我仍然要感谢你的努力:谢谢!@HedgeHog的答案解决了问题,我将其标记为已接受。 - Robert

1

我喜欢这个想法和颜色搭配-在我看来,主要问题是您试图用一个字符串替换tag,而您应该使用replace_with()将一个bs4 object替换到您的soup中,从而使其味道更加丰富。

for element in tqdm_notebook(soup.find_all()):
    if element.name == 'p':
        if len(element.text.split()) > 2: 
            element.replace_with(BeautifulSoup(get_colorized_p(element), 'html.parser'))

将您的soup转换回字符串并尝试显示它:

display(HTML(str(soup)))

在更新的代码中避免使用旧语法findAll(),而是使用find_all() - 更多信息请花一分钟查看文档

示例
from bs4 import BeautifulSoup
from IPython.display import display, HTML

html = '''
    <p>Elephants are the largest ...</p>
'''
soup = BeautifulSoup(html, 'html.parser')

def get_colorized_p(element):
    ### processing and returning of result str
    return '<p><mark style="background: #edf8fb;">Elephants are the largest existing land animals.</mark><mark style="background: #f1fafc;">Three living species are currently recognised: the African bush elephant, the African forest elephant, and the Asian elephant.</mark><mark style="background: #f3fafc;">They are an informal grouping within the proboscidean family Elephantidae.</mark><mark style="background: #f3fafc;">Elephantidae is the only surviving family of proboscideans; extinct members include the mastodons.</mark><mark style="background: #eff9fb;">Elephantidae also contains several extinct groups, including the mammoths and straight-tusked elephants.</mark><mark style="background: #68c3a6;">African elephants have larger ears and concave backs, whereas Asian elephants have smaller ears, and convex or level backs.</mark><mark style="background: #56ba91;">The distinctive features of all elephants include a long proboscis called a trunk, tusks, large ear flaps, massive legs, and tough but sensitive skin.</mark><mark style="background: #d4efec;">The trunk is used for breathing, bringing food and water to the mouth, and grasping objects.</mark><mark style="background: #e7f6f9;">Tusks, which are derived from the incisor teeth, serve both as weapons and as tools for moving objects and digging.</mark><mark style="background: #d9f1f0;">The large ear flaps assist in maintaining a constant body temperature as well as in communication.</mark><mark style="background: #e5f5f9;">The pillar-like legs carry their great weight.</mark><mark style="background: #72c7ad;"> </mark></p>'

for element in soup.find_all():
    if element.name == 'p':
        if len(element.text.split()) > 2: 
            element.replace_with(BeautifulSoup(get_colorized_p(element), 'html.parser'))

display(HTML(str(soup)))

虽然不完全相同,但非常接近您问题中的行为:如何使用<b>标签包装每个单词的首字母?


1
这位先生,这是一个很好的答案。完美无缺。我确实看过replace_with,但从未想过再次将其包装为BS4对象!非常感谢你! - Robert

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接