使用BeautifulSoup移除所有内联样式

Question

使用BeautifulSoup移除所有内联样式

19

我正在使用BeautifulSoup进行HTML清理。对Python和BeautifulSoup都是新手。我根据stackoverflow上找到的答案实现了正确删除标签，具体如下：

[s.extract() for s in soup('script')]

但如何删除内联样式？例如以下内容：

<p class="author" id="author_id" name="author_name" style="color:red;">Text</p>
<img class="some_image" href="somewhere.com">

应该变成：

<p>Text</p>
<img href="somewhere.com">

如何删除所有元素的内联class、id、name和style属性？

我找到的其他类似问题的答案都提到了使用CSS解析器来处理这个问题，而不是BeautifulSoup，但是由于任务只是简单地删除而不是操作属性，并且是适用于所有标记的通用规则，所以我希望能找到一种在BeautifulSoup中完成所有操作的方法。

- Ila

7个回答

11

我不会在 BeautifulSoup 中这样做 - 你将花费大量时间尝试、测试并解决边缘情况的问题。

Bleach 正是为此而生。 http://pypi.python.org/pypi/bleach

如果你想在 BeautifulSoup 中这样做，我建议你使用“白名单”方法，就像 Bleach 所做的一样。决定哪些标签可能有哪些属性，并剥离与之不匹配的所有标签/属性。

- Jonathan Vanasco

牛，我不知道漂白剂。我没有考虑使用情况，但如果目标是消毒不受信任的HTML，那么这绝对似乎是一个更好的方法。你获得我的赞同！ - jmk

漂白剂非常棒，我真的很喜欢它。 - Jonathan Vanasco

4

这是我针对Python3和BeautifulSoup4的解决方案：

def remove_attrs(soup, whitelist=tuple()):
    for tag in soup.findAll(True):
        for attr in [attr for attr in tag.attrs if attr not in whitelist]:
            del tag[attr]
    return soup

它支持一个白名单属性，这些属性应该被保留。如果没有提供白名单，则所有属性都会被移除。

- Klemen Tusar

2

from lxml.html.clean import Cleaner

content_without_styles = Cleaner(style=True).clean_html(content)

- Mark Mishyn

1

基于jmk的函数，我使用这个函数根据白名单来删除属性：

适用于Python2和BeautifulSoup3。

def clean(tag,whitelist=[]):
    tag.attrs = None
    for e in tag.findAll(True):
        for attribute in e.attrs:
            if attribute[0] not in whitelist:
                del e[attribute[0]]
        #e.attrs = None     #delte all attributes
    return tag

#example to keep only title and href
clean(soup,["title","href"])

- Z.J.

2

不应该将可变结构作为默认函数参数值传递。如此所见这里。 - Can Bascil

0

我使用了re和regex来实现这个。

import re

def removeStyle(html):
  style = re.compile(' style\=.*?\".*?\"')    
  html = re.sub(style, '', html)

  return(html)

html = '<p class="author" id="author_id" name="author_name" style="color:red;">Text</p>'

removeStyle(html)

输出：<p class="author" id="author_id" name="author_name">文本</p>

您可以使用此方法通过将正则表达式中的“style”替换为属性名称来删除任何内联属性。

- Tony Bryant

0

不完美但简短：

' '.join([el.text for tag in soup for el in tag.findAllNext(whitelist)]);

- Radio Controlled

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jmk · Accepted Answer

如果你只想删除所有CSS，无需解析任何CSS。BeautifulSoup提供了一种删除整个属性的方法，如下所示：

for tag in soup():
    for attribute in ["class", "id", "name", "style"]:
        del tag[attribute]

此外，如果您只想删除整个标签（以及其内容），则不需要使用返回标签的extract()函数。您只需要使用decompose()函数即可：

[tag.decompose() for tag in soup("script")]

虽然差别不是太大，但我在查看文档时发现了另外一件事情。你可以在BeautifulSoup documentation中找到更多API的细节和很多例子。