在Python中转义特殊的HTML字符

28

我有一个字符串,其中包含特殊字符,如'"&(...)。在字符串中:

string = """ Hello "XYZ" this 'is' a test & so on """

如何自动转义所有特殊字符,以便获得以下结果:

string = " Hello "XYZ" this 'is' a test & so on "
4个回答

53
在Python 3.2中,您可以使用html.escape函数,例如:
>>> string = """ Hello "XYZ" this 'is' a test & so on """
>>> import html
>>> html.escape(string)
' Hello "XYZ" this 'is' a test & so on '

对于Python早期版本,请查看http://wiki.python.org/moin/EscapingHtml

The cgi module that comes with Python has an escape() function:

import cgi

s = cgi.escape( """& < >""" )   # s = "&amp; &lt; &gt;"

However, it doesn't escape characters beyond &, <, and >. If it is used as cgi.escape(string_to_escape, quote=True), it also escapes ".


Here's a small snippet that will let you escape quotes and apostrophes as well:

 html_escape_table = {
     "&": "&amp;",
     '"': "&quot;",
     "'": "&apos;",
     ">": "&gt;",
     "<": "&lt;",
     }

 def html_escape(text):
     """Produce entities within text."""
     return "".join(html_escape_table.get(c,c) for c in text)

You can also use escape() from xml.sax.saxutils to escape html. This function should execute faster. The unescape() function of the same module can be passed the same arguments to decode a string.

from xml.sax.saxutils import escape, unescape
# escape() and unescape() takes care of &, < and >.
html_escape_table = {
    '"': "&quot;",
    "'": "&apos;"
}
html_unescape_table = {v:k for k, v in html_escape_table.items()}

def html_escape(text):
    return escape(text, html_escape_table)

def html_unescape(text):
    return unescape(text, html_unescape_table)

感谢您在 cgi.escape 中使用了 quote=True - sidx
注意,你的一些替换不符合HTML标准。例如:https://www.w3.org/TR/xhtml1/#C_16 请使用'代替'。我猜还有其他一些替换是在HTML4标准中添加的,但这个不是。 - leetNightshade
我来到这里是为了寻找一种取消转义特殊字符的方法,然后我发现 HTML 模块有一个 unescape() 方法 :) html.unescape('一些 '单引号'') - Дмитро Олександрович

5

cgi.escape 方法将把特殊字符转换为有效的HTML标签。

 import cgi
 original_string = 'Hello "XYZ" this \'is\' a test & so on '
 escaped_string = cgi.escape(original_string, True)
 print original_string
 print escaped_string

会导致
Hello "XYZ" this 'is' a test & so on 
Hello &quot;XYZ&quot; this 'is' a test &amp; so on 

在cgi.escape中,第二个参数是可选的,用于转义引号。默认情况下,不会转义引号。


1
我不明白为什么cgi.escape在转换引号方面如此挑剔,而完全忽略单引号。 - Ned Batchelder
1
因为在PCDATA中不需要转义引号,但在属性中(通常使用双引号作为分隔符)需要转义引号,前者比后者更常见。一般来说,这是一个经典的90%解决方案(更像是>99%)。如果你必须节省每一个字节并且希望动态地确定哪种引用方式可以实现这一点,请使用xml.sax.saxutils.quoteattr() - Mike DeSimone
只是一则提醒,从Python 3.11开始,cgi已被弃用,并将在Python 3.13中移除。PEP 594 - nigh_anxiety

4
一个简单的字符串函数就可以实现这个功能:
def escape(t):
    """HTML-escape the text in `t`."""
    return (t
        .replace("&", "&amp;").replace("<", "&lt;").replace(">", "&gt;")
        .replace("'", "&#39;").replace('"', "&quot;")
        )

这个线程中的其他答案存在一些小问题:cgi.escape方法出于某种原因忽略单引号,您需要明确要求它做双引号。链接的维基页面使用XML实体&apos;,而不是HTML实体,可以执行所有五个操作。

这个代码函数始终使用HTML标准实体执行所有五个操作。


2
这里的其他答案将有助于解决您列出的字符和一些其他内容。但是,如果您还希望将其他所有内容转换为实体名称,则需要做一些其他工作。例如,如果需要将á转换为&aacute;,那么cgi.escapehtml.escape都无法帮助您。您需要使用html.entities.entitydefs的类似以下代码的方法,这只是一个字典。(以下代码是针对Python 3.x制作的,但是有部分尝试使其与2.x兼容以给您一个想法):
# -*- coding: utf-8 -*-

import sys

if sys.version_info[0]>2:
    from html.entities import entitydefs
else:
    from htmlentitydefs import entitydefs

text=";\"áèïøæỳ" #This is your string variable containing the stuff you want to convert
text=text.replace(";", "$ஸ$") #$ஸ$ is just something random the user isn't likely to have in the document. We're converting it so it doesn't convert the semi-colons in the entity name into entity names.
text=text.replace("$ஸ$", "&semi;") #Converting semi-colons to entity names

if sys.version_info[0]>2: #Using appropriate code for each Python version.
    for k,v in entitydefs.items():
        if k not in {"semi", "amp"}:
            text=text.replace(v, "&"+k+";") #You have to add the & and ; manually.
else:
    for k,v in entitydefs.iteritems():
        if k not in {"semi", "amp"}:
            text=text.replace(v, "&"+k+";") #You have to add the & and ; manually.

#The above code doesn't cover every single entity name, although I believe it covers everything in the Latin-1 character set. So, I'm manually doing some common ones I like hereafter:
text=text.replace("ŷ", "&ycirc;")
text=text.replace("Ŷ", "&Ycirc;")
text=text.replace("ŵ", "&wcirc;")
text=text.replace("Ŵ", "&Wcirc;")
text=text.replace("ỳ", "&#7923;")
text=text.replace("Ỳ", "&#7922;")
text=text.replace("ẃ", "&wacute;")
text=text.replace("Ẃ", "&Wacute;")
text=text.replace("ẁ", "&#7809;")
text=text.replace("Ẁ", "&#7808;")

print(text)
#Python 3.x outputs: &semi;&quot;&aacute;&egrave;&iuml;&oslash;&aelig;&#7923;
#The Python 2.x version outputs the wrong stuff. So, clearly you'll have to adjust the code somehow for it.

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接