Python中如何从字符串中删除HTML标签

363
from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
  print line
当我在 HTML 文件中打印一行时,我想找到一种只显示每个 HTML 元素的内容而不是格式本身的方法。如果它发现 '<a href="whatever.example">some text</a>',它只会打印 'some text','<b>hello</b>' 打印 'hello' 等等。如何实现这样的功能?

18
处理HTML实体(例如&amp;)是一个重要的考虑因素。你可以选择:1)删除它们和标记(通常不可取,因为它们等同于纯文本),2)保持它们不变(如果被剥离的文本将要回到HTML环境中,则是一种合适的解决方案),或 3)将它们解码为纯文本(如果被剥离的文本将要进入数据库或其他非HTML环境中,或者如果你的网页框架自动对文本进行HTML转义)。 - Søren Løvborg
2
针对@SørenLøvborg的第二点建议,请参考以下链接:https://dev59.com/onRA5IYBdhLWcg3w_DLF - Robert
5
这里的最佳答案曾被 Django 项目使用直至 2014 年 3 月,但已经发现存在跨站脚本攻击漏洞。点击链接可查看一个能够成功攻击的例子。我建议使用 Bleach.clean()、Markupsafe 的 striptags 或 RECENT Django 的 strip_tags。 - rescdsk
28个回答

8
美丽汤包能够立即为您完成这项操作。
from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
text = soup.get_text()
print(text)

4
我能否请求您在回答中添加更多的上下文内容。仅有代码的回答很难理解。如果您能在帖子中添加更多信息,这将有助于提问者和未来的读者。 - help-info.de

3
这是我为Python 3提供的解决方案。
import html
import re

def html_to_txt(html_text):
    ## unescape html
    txt = html.unescape(html_text)
    tags = re.findall("<[^>]+>",txt)
    print("found tags: ")
    print(tags)
    for tag in tags:
        txt=txt.replace(tag,'')
    return txt

不确定是否完美,但解决了我的使用情况并且看起来很简单。


3
以下是翻译:
这里提供了一种类似于目前被接受的答案(https://dev59.com/onRA5IYBdhLWcg3w_DLF#925630)的解决方案,但它直接使用内部的 HTMLParser 类(即不需要子类化),因此更加简洁:
def strip_html(text):
    parts = []                                                                      
    parser = HTMLParser()                                                           
    parser.handle_data = parts.append                                               
    parser.feed(text)                                                               
    return ''.join(parts)

2

有一个项目,我需要去除HTML,还要去除CSS和JS。因此,我修改了Eloff的答案:

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
        self.css = False
    def handle_starttag(self, tag, attrs):
        if tag == "style" or tag=="script":
            self.css = True
    def handle_endtag(self, tag):
        if tag=="style" or tag=="script":
            self.css=False
    def handle_data(self, d):
        if not self.css:
            self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

2
您可以使用不同的HTML解析器(例如lxmlBeautiful Soup),其中提供了提取纯文本的功能。或者,您可以在行字符串上运行正则表达式以剥离标签。请参阅Python文档获取更多信息。

1
AMK链接已失效。有其他替代方案吗? - user1228
2
Python官网现在有很好的教程,这里是正则表达式教程:http://docs.python.org/howto/regex - Jason Coon
7
在lxml中,lxml.html.fromstring(s).text_content()的作用是将HTML字符串s转换为解析树,并提取出其中所有文本内容。 - Bluu
2
Bluu的示例使用lxml解码HTML实体(例如&amp;)为文本。 - Søren Løvborg

1

我已经成功地使用了Eloff的答案来处理Python 3.1 [非常感谢!]。

我升级到了Python 3.2.3,但遇到了错误。

提供解决方案的回答者Thomas K在这里提供了帮助,需要在以下代码中插入super().__init__()

def __init__(self):
    self.reset()
    self.fed = []

... 为了使其看起来像这样:

def __init__(self):
    super().__init__()
    self.reset()
    self.fed = []

...并且它将适用于Python 3.2.3。

再次感谢Thomas K提供的修复和Eloff上面提供的原始代码!


1
使用HTML-Parser的解决方案只能运行一次,且存在漏洞。
html_to_text('<<b>script>alert("hacked")<</b>/script>

结果为:
<script>alert("hacked")</script>

你想要防止什么。如果你使用HTML解析器,请计算标签数量,直到替换为零:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
        self.containstags = False

    def handle_starttag(self, tag, attrs):
       self.containstags = True

    def handle_data(self, d):
        self.fed.append(d)

    def has_tags(self):
        return self.containstags

    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    must_filtered = True
    while ( must_filtered ):
        s = MLStripper()
        s.feed(html)
        html = s.get_data()
        must_filtered = s.has_tags()
    return html

1
如果您调用名为html_to_text的函数,并在未转义该函数输出的文本的情况下将其嵌入到HTML中,则缺少转义是安全漏洞,而不是html_to_text函数。 html_to_text函数从未承诺输出将是文本。无论您是否从html_to_text或其他来源获取文本,将文本插入HTML中而不进行转义都可能存在潜在的安全漏洞。 - kasperd
在这种情况下,你是正确的,如果缺少转义,但问题是从给定字符串中剥离HTML而不是转义给定字符串。如果早期答案通过从其解决方案中删除一些HTML来构建新的HTML作为结果,则使用此解决方案是危险的。 - Falk Nisius

1

这是一个快速的修复方法,可以进行更多的优化,但它可以正常工作。此代码将使用 "" 替换所有非空标签,并从给定的输入文本中剥离所有 HTML 标签。您可以使用 ./file.py input output 来运行它。

    #!/usr/bin/python
import sys

def replace(strng,replaceText):
    rpl = 0
    while rpl > -1:
        rpl = strng.find(replaceText)
        if rpl != -1:
            strng = strng[0:rpl] + strng[rpl + len(replaceText):]
    return strng


lessThanPos = -1
count = 0
listOf = []

try:
    #write File
    writeto = open(sys.argv[2],'w')

    #read file and store it in list
    f = open(sys.argv[1],'r')
    for readLine in f.readlines():
        listOf.append(readLine)         
    f.close()

    #remove all tags  
    for line in listOf:
        count = 0;  
        lessThanPos = -1  
        lineTemp =  line

            for char in lineTemp:

            if char == "<":
                lessThanPos = count
            if char == ">":
                if lessThanPos > -1:
                    if line[lessThanPos:count + 1] != '<>':
                        lineTemp = replace(lineTemp,line[lessThanPos:count + 1])
                        lessThanPos = -1
            count = count + 1
        lineTemp = lineTemp.replace("&lt","<")
        lineTemp = lineTemp.replace("&gt",">")                  
        writeto.write(lineTemp)  
    writeto.close() 
    print "Write To --- >" , sys.argv[2]
except:
    print "Help: invalid arguments or exception"
    print "Usage : ",sys.argv[0]," inputfile outputfile"

1

2020更新

使用 Mozilla漂白库,它确实可以让您自定义保留哪些标签和哪些属性,并根据值过滤掉属性。

以下是两个示例:

1)不允许任何HTML标签或属性

取样本原始文本

raw_text = """
<p><img width="696" height="392" src="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg" class="attachment-medium_large size-medium_large wp-post-image" alt="Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC" style="float:left; margin:0 15px 15px 0;" srcset="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg 768w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-300x169.jpg 300w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1024x576.jpg 1024w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-696x392.jpg 696w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1068x601.jpg 1068w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-747x420.jpg 747w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-190x107.jpg 190w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-380x214.jpg 380w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-760x428.jpg 760w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc.jpg 1280w" sizes="(max-width: 696px) 100vw, 696px" />Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform&#8217;s users. Also as part [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://news.bitcoin.com/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc/">Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC</a> appeared first on <a rel="nofollow" href="https://news.bitcoin.com">Bitcoin News</a>.</p> 
"""

2) 从原始文本中删除所有HTML标签和属性

# DO NOT ALLOW any tags or any attributes
from bleach.sanitizer import Cleaner
cleaner = Cleaner(tags=[], attributes={}, styles=[], protocols=[], strip=True, strip_comments=True, filters=None)
print(cleaner.clean(raw_text))

输出

Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform&#8217;s users. Also as part [&#8230;]
The post Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC appeared first on Bitcoin News. 

3 只允许带有srcset属性的img标签

from bleach.sanitizer import Cleaner
# ALLOW ONLY img tags with src attribute
cleaner = Cleaner(tags=['img'], attributes={'img': ['srcset']}, styles=[], protocols=[], strip=True, strip_comments=True, filters=None)
print(cleaner.clean(raw_text))

输出

<img srcset="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg 768w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-300x169.jpg 300w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1024x576.jpg 1024w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-696x392.jpg 696w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1068x601.jpg 1068w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-747x420.jpg 747w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-190x107.jpg 190w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-380x214.jpg 380w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-760x428.jpg 760w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc.jpg 1280w">Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform&#8217;s users. Also as part [&#8230;]
The post Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC appeared first on Bitcoin News. 

1
一种 Python 3 的适配版本,基于 søren-løvborg 的答案。保留 HTML 格式,不添加解释。
from html.parser import HTMLParser
from html.entities import html5

class HTMLTextExtractor(HTMLParser):
    """ Adaption of https://dev59.com/onRA5IYBdhLWcg3w_DLF#7778368 """
    def __init__(self):
        super().__init__()
        self.result = []

    def handle_data(self, d):
        self.result.append(d)

    def handle_charref(self, number):
        codepoint = int(number[1:], 16) if number[0] in (u'x', u'X') else int(number)
        self.result.append(unichr(codepoint))

    def handle_entityref(self, name):
        if name in html5:
            self.result.append(unichr(html5[name]))

    def get_text(self):
        return u''.join(self.result)

def html_to_text(html):
    s = HTMLTextExtractor()
    s.feed(html)
    return s.get_text()

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接