Python中如何从字符串中删除HTML标签

Question

Python中如何从字符串中删除HTML标签

363

from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
  print line

当我在 HTML 文件中打印一行时，我想找到一种只显示每个 HTML 元素的内容而不是格式本身的方法。如果它发现 '<a href="whatever.example">some text</a>'，它只会打印 'some text'，'<b>hello</b>' 打印 'hello' 等等。如何实现这样的功能？

- directedition

18

处理HTML实体（例如&）是一个重要的考虑因素。你可以选择：1）删除它们和标记（通常不可取，因为它们等同于纯文本），2）保持它们不变（如果被剥离的文本将要回到HTML环境中，则是一种合适的解决方案），或 3）将它们解码为纯文本（如果被剥离的文本将要进入数据库或其他非HTML环境中，或者如果你的网页框架自动对文本进行HTML转义）。 - Søren Løvborg

2

针对@SørenLøvborg的第二点建议，请参考以下链接：https://dev59.com/onRA5IYBdhLWcg3w_DLF - Robert

5

这里的最佳答案曾被 Django 项目使用直至 2014 年 3 月，但已经发现存在跨站脚本攻击漏洞。点击链接可查看一个能够成功攻击的例子。我建议使用 Bleach.clean()、Markupsafe 的 striptags 或 RECENT Django 的 strip_tags。 - rescdsk

28个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- runawaykid · Answer 1

美丽汤包能够立即为您完成这项操作。

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
text = soup.get_text()
print(text)

- John Loutzenhiser · Answer 2

这是我为Python 3提供的解决方案。

import html
import re

def html_to_txt(html_text):
    ## unescape html
    txt = html.unescape(html_text)
    tags = re.findall("<[^>]+>",txt)
    print("found tags: ")
    print(tags)
    for tag in tags:
        txt=txt.replace(tag,'')
    return txt

不确定是否完美，但解决了我的使用情况并且看起来很简单。

- Richard · Answer 3

以下是翻译：

这里提供了一种类似于目前被接受的答案（https://dev59.com/onRA5IYBdhLWcg3w_DLF#925630）的解决方案，但它直接使用内部的 HTMLParser 类（即不需要子类化），因此更加简洁：

def strip_html(text):
    parts = []                                                                      
    parser = HTMLParser()                                                           
    parser.handle_data = parts.append                                               
    parser.feed(text)                                                               
    return ''.join(parts)

- mousetail · Answer 4

有一个项目，我需要去除HTML，还要去除CSS和JS。因此，我修改了Eloff的答案：

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
        self.css = False
    def handle_starttag(self, tag, attrs):
        if tag == "style" or tag=="script":
            self.css = True
    def handle_endtag(self, tag):
        if tag=="style" or tag=="script":
            self.css=False
    def handle_data(self, d):
        if not self.css:
            self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

- Jason Coon · Answer 5

您可以使用不同的HTML解析器（例如lxml或Beautiful Soup），其中提供了提取纯文本的功能。或者，您可以在行字符串上运行正则表达式以剥离标签。请参阅Python文档获取更多信息。

- MilesNielsen · Answer 6

我已经成功地使用了Eloff的答案来处理Python 3.1 [非常感谢！]。

我升级到了Python 3.2.3，但遇到了错误。

提供解决方案的回答者Thomas K在这里提供了帮助，需要在以下代码中插入super().__init__()：

def __init__(self):
    self.reset()
    self.fed = []

... 为了使其看起来像这样：

def __init__(self):
    super().__init__()
    self.reset()
    self.fed = []

...并且它将适用于Python 3.2.3。

再次感谢Thomas K提供的修复和Eloff上面提供的原始代码！

- Falk Nisius · Answer 7

使用HTML-Parser的解决方案只能运行一次，且存在漏洞。

html_to_text('<<b>script>alert("hacked")<</b>/script>

结果为：

<script>alert("hacked")</script>

你想要防止什么。如果你使用HTML解析器，请计算标签数量，直到替换为零：

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
        self.containstags = False

    def handle_starttag(self, tag, attrs):
       self.containstags = True

    def handle_data(self, d):
        self.fed.append(d)

    def has_tags(self):
        return self.containstags

    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    must_filtered = True
    while ( must_filtered ):
        s = MLStripper()
        s.feed(html)
        html = s.get_data()
        must_filtered = s.has_tags()
    return html

- kiran Mohan · Answer 8

这是一个快速的修复方法，可以进行更多的优化，但它可以正常工作。此代码将使用 "" 替换所有非空标签，并从给定的输入文本中剥离所有 HTML 标签。您可以使用 ./file.py input output 来运行它。

    #!/usr/bin/python
import sys

def replace(strng,replaceText):
    rpl = 0
    while rpl > -1:
        rpl = strng.find(replaceText)
        if rpl != -1:
            strng = strng[0:rpl] + strng[rpl + len(replaceText):]
    return strng


lessThanPos = -1
count = 0
listOf = []

try:
    #write File
    writeto = open(sys.argv[2],'w')

    #read file and store it in list
    f = open(sys.argv[1],'r')
    for readLine in f.readlines():
        listOf.append(readLine)         
    f.close()

    #remove all tags  
    for line in listOf:
        count = 0;  
        lessThanPos = -1  
        lineTemp =  line

            for char in lineTemp:

            if char == "<":
                lessThanPos = count
            if char == ">":
                if lessThanPos > -1:
                    if line[lessThanPos:count + 1] != '<>':
                        lineTemp = replace(lineTemp,line[lessThanPos:count + 1])
                        lessThanPos = -1
            count = count + 1
        lineTemp = lineTemp.replace("&lt","<")
        lineTemp = lineTemp.replace("&gt",">")                  
        writeto.write(lineTemp)  
    writeto.close() 
    print "Write To --- >" , sys.argv[2]
except:
    print "Help: invalid arguments or exception"
    print "Usage : ",sys.argv[0]," inputfile outputfile"

- PirateApp · Answer 9

2020更新

使用 Mozilla漂白库，它确实可以让您自定义保留哪些标签和哪些属性，并根据值过滤掉属性。

以下是两个示例：

1）不允许任何HTML标签或属性

取样本原始文本

raw_text = """
<p><img width="696" height="392" src="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg" class="attachment-medium_large size-medium_large wp-post-image" alt="Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC" style="float:left; margin:0 15px 15px 0;" srcset="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg 768w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-300x169.jpg 300w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1024x576.jpg 1024w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-696x392.jpg 696w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1068x601.jpg 1068w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-747x420.jpg 747w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-190x107.jpg 190w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-380x214.jpg 380w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-760x428.jpg 760w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc.jpg 1280w" sizes="(max-width: 696px) 100vw, 696px" />Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform&#8217;s users. Also as part [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://news.bitcoin.com/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc/">Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC</a> appeared first on <a rel="nofollow" href="https://news.bitcoin.com">Bitcoin News</a>.</p> 
"""

2) 从原始文本中删除所有HTML标签和属性

# DO NOT ALLOW any tags or any attributes
from bleach.sanitizer import Cleaner
cleaner = Cleaner(tags=[], attributes={}, styles=[], protocols=[], strip=True, strip_comments=True, filters=None)
print(cleaner.clean(raw_text))

输出

Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform&#8217;s users. Also as part [&#8230;]
The post Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC appeared first on Bitcoin News.

3 只允许带有srcset属性的img标签

from bleach.sanitizer import Cleaner
# ALLOW ONLY img tags with src attribute
cleaner = Cleaner(tags=['img'], attributes={'img': ['srcset']}, styles=[], protocols=[], strip=True, strip_comments=True, filters=None)
print(cleaner.clean(raw_text))

输出

<img srcset="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg 768w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-300x169.jpg 300w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1024x576.jpg 1024w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-696x392.jpg 696w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1068x601.jpg 1068w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-747x420.jpg 747w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-190x107.jpg 190w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-380x214.jpg 380w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-760x428.jpg 760w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc.jpg 1280w">Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform&#8217;s users. Also as part [&#8230;]
The post Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC appeared first on Bitcoin News.

- CpILL · Answer 10

一种 Python 3 的适配版本，基于 søren-løvborg 的答案。保留 HTML 格式，不添加解释。

from html.parser import HTMLParser
from html.entities import html5

class HTMLTextExtractor(HTMLParser):
    """ Adaption of https://dev59.com/onRA5IYBdhLWcg3w_DLF#7778368 """
    def __init__(self):
        super().__init__()
        self.result = []

    def handle_data(self, d):
        self.result.append(d)

    def handle_charref(self, number):
        codepoint = int(number[1:], 16) if number[0] in (u'x', u'X') else int(number)
        self.result.append(unichr(codepoint))

    def handle_entityref(self, name):
        if name in html5:
            self.result.append(unichr(html5[name]))

    def get_text(self):
        return u''.join(self.result)

def html_to_text(html):
    s = HTMLTextExtractor()
    s.feed(html)
    return s.get_text()