如何使用Python/Django进行HTML解码/编码？

Question

如何使用Python/Django进行HTML解码/编码？

170

我有一个被HTML编码的字符串：

'''&lt;img class=&quot;size-medium wp-image-113&quot;\
 style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot;\
 src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot;\
 alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;'''

我想把那个改成：

<img class="size-medium wp-image-113" style="margin-left: 15px;" 
  title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" 
  alt="" width="300" height="194" />

我希望这段文本能够以HTML格式注册，这样浏览器就可以将其渲染为图像而不是显示为文本。

字符串是这样存储的，因为我正在使用一个名为BeautifulSoup的网络爬虫工具，它会“扫描”网页并获取其中的特定内容，然后以该格式返回字符串。

我已经在C#中找到了解决方法，但是在Python中还没有。有人能帮帮我吗？

相关问题

如何在Python中将XML/HTML实体转换为Unicode字符串

- rksprst

15个回答

142

针对Django的使用情况，这里有两个答案。以下是参考用的django.utils.html.escape 函数：

def escape(html):
    """Returns the given HTML with ampersands, quotes and carets encoded."""
    return mark_safe(force_unicode(html).replace('&', '&amp;').replace('<', '&l
t;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&#39;'))

要撤消这个操作，可以使用Jake的答案中描述的Cheetah函数，但是缺少单引号。这个版本包括一个更新的元组，替换顺序被颠倒以避免对称问题:

def html_decode(s):
    """
    Returns the ASCII decoded version of the given HTML string. This does
    NOT remove normal HTML tags like <p>.
    """
    htmlCodes = (
            ("'", '&#39;'),
            ('"', '&quot;'),
            ('>', '&gt;'),
            ('<', '&lt;'),
            ('&', '&amp;')
        )
    for code in htmlCodes:
        s = s.replace(code[1], code[0])
    return s

unescaped = html_decode(my_string)

然而，这并不是一个通用的解决方案；它只适用于使用django.utils.html.escape编码的字符串。更一般地，最好坚持使用标准库：

# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# >= Python 3.5:
from html import unescape
unescaped = unescape(my_string)

作为建议：在您的数据库中以未转义的形式存储HTML可能更有意义。如果可能的话，值得尝试从BeautifulSoup获取未转义的结果，从而完全避免此过程。

使用Django时，只有在模板呈现期间才会进行转义；因此，要防止转义，只需告诉模板引擎不要转义您的字符串即可。要执行此操作，请在模板中使用以下选项之一：

{{ context_var|safe }}
{% autoescape off %}
    {{ context_var }}
{% endautoescape %}

- Daniel

1

为什么不使用Django或Cheetah？ - Mat

4

django.utils.html.escape 没有相反的功能吗？ - Mat

13

我认为Django中只有在模板渲染期间才会发生转义。因此，不需要进行反转义 - 你只需要告诉模板引擎不要进行转义。可以使用 {{ context_var|safe }} 或 {% autoescape off %}{{ context_var }}{% endautoescape %}。 - Daniel Naab

3

@Daniel：请将您的评论更改为答案，以便我可以给它投票支持！“safe” 正是我（以及我相信其他人）在回答这个问题时正在寻找的。 - Wayne Koorts

2

html.parser.HTMLParser().unescape()在3.5版本中已经被弃用。请使用html.unescape()代替。 - pjvandehaar

显示剩余3条评论

80

对于HTML编码，标准库中有cgi.escape函数：

>> help(cgi.escape)
cgi.escape = escape(s, quote=None)
    Replace special characters "&", "<" and ">" to HTML-safe sequences.
    If the optional flag quote is true, the quotation mark character (")
    is also translated.

对于HTML解码，我使用以下方法：

import re
from htmlentitydefs import name2codepoint
# for some reason, python 2.5.2 doesn't have this one (apostrophe)
name2codepoint['#39'] = 39

def unescape(s):
    "unescape HTML code refs; c.f. http://wiki.python.org/moin/EscapingHtml"
    return re.sub('&(%s);' % '|'.join(name2codepoint),
              lambda m: unichr(name2codepoint[m.group(1)]), s)

对于任何更复杂的内容，我使用BeautifulSoup。

- user26294

1

从Python文档中： "自版本3.2起已弃用：此函数不安全，因为默认情况下引号为false，因此已弃用。请改用html.escape()。" 自3.9版本及更早版本，它已经不存在了。 - Mike Gleen

20

如果编码字符集相对受限，请使用daniel的解决方案。否则，请使用众多的HTML解析库之一。

我喜欢BeautifulSoup，因为它可以处理格式不正确的XML / HTML：

http://www.crummy.com/software/BeautifulSoup/

对于你的问题，在他们的文档中有一个例子

from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup("Sacr&eacute; bl&#101;u!", 
                   convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
# u'Sacr\xe9 bleu!'

- vincent

BeautifulSoup无法转换十六进制实体（e）。https://dev59.com/MHVD5IYBdhLWcg3wL4mM#57745 - jfs

2

对于BeautifulSoup4，相应的代码为：from bs4 import BeautifulSoup BeautifulSoup("Sacré bleu!").contents[0] - radicand

17

在Python 3.4及以上版本中：

import html

html.unescape(your_string)

- Collin Anderson

1

你救了我的一天。我花了好几个小时寻找那个答案。我保存了带有德语umlauts的文本到文件中，然后不知如何把它们转换回来。现在一切都完美解决了。导入 HTML 模块使用以下代码进行转义：import html html.unescape('Klimaänderungen')即可将文本从 'Klimaänderungen' 转换成 'Klimaänderungen'。 - Дмитро Олександрович

1

张建歌在2011年已经给出了这个答案。 - mike rodent

7

在Python维基的页面底部，至少有两种选项可以“取消转义”HTML。

- zgoda

7

如果有人想通过django模板简单地实现这个功能，可以使用以下过滤器：

<html>
{{ node.description|safe }}
</html>

我从供应商那里获取了一些数据，但是我发布的所有内容都带有HTML标签，就像你在查看源代码时看到的那样。

- Chris Harty

谢谢，上帝。这个解决方案适用于Flask！ - Tyler Xue

6

Daniel的评论：

“Django模板渲染时才会出现转义。因此，不需要进行反转义——您只需告诉模板引擎不要进行转义。可以使用{{ context_var|safe }}或{% autoescape off %}{{ context_var }}{% endautoescape %}。”

- dfrankow

工作正常，只是我的 Django 版本没有 'safe'。我使用 'escape' 代替。我认为它们是一样的东西。 - willem

1

@willem：它们是相反的！ - Asherah

5

我在这里找到了一个很好的函数：http://snippets.dzone.com/posts/show/4569

（该链接为英文网站，需要自行访问）

def decodeHtmlentities(string):
    import re
    entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")

    def substitute_entity(match):
        from htmlentitydefs import name2codepoint as n2cp
        ent = match.group(2)
        if match.group(1) == "#":
            return unichr(int(ent))
        else:
            cp = n2cp.get(ent)

            if cp:
                return unichr(cp)
            else:
                return match.group()

    return entity_re.subn(substitute_entity, string)[0]

- slowkvant

使用 re 的好处在于您可以使用相同的搜索匹配 ' 和 '。 - Neal Stublen

这个程序没有处理  ，它应该解码为与   和   相同的内容。 - Mike Samuel

3

虽然这是一个非常老的问题，但是这可能有效。

Django 1.5.5

In [1]: from django.utils.text import unescape_entities
In [2]: unescape_entities('&lt;img class=&quot;size-medium wp-image-113&quot; style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot; src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot; alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;')
Out[2]: u'<img class="size-medium wp-image-113" style="margin-left: 15px;" title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" alt="" width="300" height="194" />'

- James

1

这是唯一一个能够解码作为HTML实体编码的代理对的程序，例如“��”。然后再经过另一个“result.encode('utf-16', 'surrogatepass').decode('utf-16')”，我终于得到了原始数据。 - rescdsk

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jiangge Zhang · Accepted Answer

使用标准库：

HTML转义

try:
    from html import escape  # python 3.x
except ImportError:
    from cgi import escape  # python 2.x

print(escape("<"))

HTML反转义

try:
    from html import unescape  # python 3.4+
except ImportError:
    try:
        from html.parser import HTMLParser  # python 3.x (<3.4)
    except ImportError:
        from HTMLParser import HTMLParser  # python 2.x
    unescape = HTMLParser().unescape

print(unescape("&gt;"))