如何在Python中取消转义撇号等字符？

Question

如何在Python中取消转义撇号等字符？

8

我有一个包含类似这样符号的字符串：

&#39;

显然这是一个撇号（apostrophe）。

我尝试使用saxutils.unescape()和urllib.unquote()来解码，但没有成功。

请问该如何解码？谢谢！

- rick

3个回答

2

请查看这个问题。你需要的是“HTML实体解码”。通常，你可以找到一个名为“htmldecode”的函数来完成你想要的操作。Django、Cheetah和BeautifulSoup都提供了这样的函数。

如果你不想使用库并且所有实体都是数字，则另一个答案也很好用。

- easel

谢谢。Django有什么功能？因为我在文档中找不到任何信息... - rick

它被称为django.utils.html.escape。看看我链接的另一个stackoverflow问题，了解更多细节。 - easel

看起来django.utils.html.escape只能用于编码，不能用于解码。我最终使用了BeautifulSoup。谢谢。 - rick

1

最强大的解决方案似乎是Python专家Fredrik Lundh的这个函数。它不是最短的解决方案，但它可以处理命名实体以及十六进制和十进制代码。

- John Y

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Adrian Mester · Accepted Answer

试试这个：（在这里找到它）

from htmlentitydefs import name2codepoint as n2cp
import re

def decode_htmlentities(string):
    """
    Decode HTML entities–hex, decimal, or named–in a string
    @see http://snippets.dzone.com/posts/show/4569

    >>> u = u'E tu vivrai nel terrore - L&#x27;aldil&#xE0; (1981)'
    >>> print decode_htmlentities(u).encode('UTF-8')
    E tu vivrai nel terrore - L'aldilà (1981)
    >>> print decode_htmlentities("l&#39;eau")
    l'eau
    >>> print decode_htmlentities("foo &lt; bar")                
    foo < bar
    """
    def substitute_entity(match):
        ent = match.group(3)
        if match.group(1) == "#":
            # decoding by number
            if match.group(2) == '':
                # number is in decimal
                return unichr(int(ent))
            elif match.group(2) == 'x':
                # number is in hex
                return unichr(int('0x'+ent, 16))
        else:
            # they were using a name
            cp = n2cp.get(ent)
            if cp: return unichr(cp)
            else: return match.group()

    entity_re = re.compile(r'&(#?)(x?)(\w+);')
    return entity_re.subn(substitute_entity, string)[0]