如何在Python中将HTML实体转换为Unicode,反之亦然?
如何在Python中将HTML实体转换为Unicode,反之亦然?
关于"反之亦然"(我自己也需要这个,所以找到了这个问题,但没有得到帮助,随后在另一个网站上找到了答案):
u'some string'.encode('ascii', 'xmlcharrefreplace')
将返回一个普通字符串,其中任何非ASCII字符都将转换为XML(HTML)实体。
>>> u'\u2019'.encode('utf-8').decode('utf-8').encode('ascii', 'xmlcharrefreplace')
返回 '’'
。 - Piotr Dobrogost你需要安装BeautifulSoup。
from BeautifulSoup import BeautifulStoneSoup
import cgi
def HTMLEntitiesToUnicode(text):
"""Converts HTML entities to unicode. For example '&' becomes '&'."""
text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
return text
def unicodeToHTMLEntities(text):
"""Converts unicode to HTML entities. For example '&' becomes '&'."""
text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
return text
text = "&, ®, <, >, ¢, £, ¥, €, §, ©"
uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)
print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &, ®, <, >, ¢, £, ¥, €, §, ©
Python 2.7 和 BeautifulSoup4 的更新
使用 htmlparser
将 Unicode HTML 转换为 Unicode(Python 2.7 标准库):
>>> escaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from HTMLParser import HTMLParser
>>> htmlparser = HTMLParser()
>>> unescaped = htmlparser.unescape(escaped)
>>> unescaped
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print unescaped
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
使用bs4
(BeautifulSoup4)将Unicode HTML反转义为Unicode:
>>> html = '''<p>Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood</p>'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.text
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print soup.text
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
使用bs4
(BeautifulSoup4)将Unicode转换为unicode HTML:
>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from bs4.dammit import EntitySubstitution
>>> escaper = EntitySubstitution()
>>> escaped = escaper.substitute_html(unescaped)
>>> escaped
u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
正如hekevintran的回答所建议的,你可以使用cgi.escape(s)
对字符串进行编码,但需要注意该函数中的引号编码默认为false,因此最好在字符串旁边传递quote=True
关键字参数。即使通过传递quote=True
,该函数也不会转义单引号("'"
)。(由于这些问题,该函数自版本3.2以来已被废弃)
建议使用html.escape(s)
替代cgi.escape(s)
。(自版本3.2开始)
同时,在版本3.4中引入了html.unescape(s)
。
因此,在Python 3.4中:
html.escape(text).encode('ascii', 'xmlcharrefreplace').decode()
将特殊字符转换为HTML实体。html.unescape(text)
将HTML实体转换回普通文本表示。$ python3 -c "
> import html
> print(
> html.unescape('&©—')
> )"
&©—
$ python3 -c "
> import html
> print(
> html.escape('&©—')
> )"
&©—
$ python2 -c "
> from HTMLParser import HTMLParser
> print(
> HTMLParser().unescape('&©—')
> )"
&©—
$ python2 -c "
> import cgi
> print(
> cgi.escape('&©—')
> )"
&©—
HTML只严格要求对&
(和号)和<
(左尖括号/小于号)进行转义。 https://html.spec.whatwg.org/multipage/parsing.html#data-state
™(商标符号),€(欧元符号)
未被正确编码,原因在于在ISO-8859-1(又名Windows-1252)中这些字符未定义。
同时请注意,默认字符集从html4更改为html5的utf-8。
因此,我们必须找到解决方法(首先查找和替换它们)。
参考来源(起点)来自Mozilla的文档。
https://developer.mozilla.org/en-US/docs/Web/Guide/Localizations_and_character_encodings
def html_wr(f, dat):
''' write dat to file f as html
. file is assumed to be opened in binary format
. if dat is nul it is replaced with non breakable space
. non-ascii characters are translated to xml
'''
if not dat:
dat = ' '
try:
f.write(dat.encode('ascii'))
except:
f.write(html.escape(dat).encode('ascii', 'xmlcharrefreplace'))
#!/usr/bin/env python3
import fileinput
import html
for line in fileinput.input():
print(html.unescape(line.rstrip('\n')))