Python显示特殊字符

Question

Python显示特殊字符

pythonencodingspecial-characterspython-unicode

5

我知道有很多关于这个问题的帖子，但我还没有找到解决我的问题的帖子。

我正在尝试打印一个字符串，但是当打印时它不显示特殊字符（例如æ，ø，å，ö和ü）。当我使用repr()打印字符串时，得到以下结果：

u'Von D\xc3\xbc' 和 u'\xc3\x96berg'

有人知道如何将其转换为 Von Dü 和 Öberg吗？对我来说，重要的是这些字符不被忽略，例如myStr.encode("ascii", "ignore")。

编辑

这是我用来爬取网站的代码。表格(<table>)中单元格(<td>)的内容被放入变量name中。这就是包含无法打印的特殊字符的变量。

web = urllib2.urlopen(url);
soup = BeautifulSoup(web)
tables = soup.find_all("table")
scene_tables = [2, 3, 6, 7, 10]
scene_index = 0
# Iterate over the <table>s we want to work with
for scene_table in scene_tables:
    i = 0
    # Iterate over < td> to find time and name
    for td in tables[scene_table].find_all("td"):
        if i % 2 == 0:  # td contains the time
            time = remove_whitespace(td.get_text())
        else:           # td contains the name
            name = remove_whitespace(td.get_text()) # This is the variable containing "nonsense"
            print "%s: %s" % (time, name,)
        i += 1
    scene_index += 1

- simonbs

你的控制台是否设置为UTF-8模式？ - Fabian

我正在使用Mac OS X中的默认终端，并启用了UTF-8。 - simonbs

3个回答

3

许多语言中的Unicode支持令人困惑，因此您在这里犯错是可以理解的。那些字符串是UTF-8字节，如果您删除前面的u，它们将正常工作:

>>> err = u'\xc3\x96berg'
>>> print err
Ã?berg
>>> x = '\xc3\x96berg'
>>> print x
Öberg
>>> u = x.decode('utf-8')
>>> u
u'\xd6berg'
>>> print u
Öberg

更多信息请查看以下链接：

http://www.joelonsoftware.com/articles/Unicode.html

http://docs.python.org/howto/unicode.html

在继续之前，您应该阅读并理解这些链接中的内容。如果您非常需要立即使用某些内容，您可以使用这个令人尴尬的可怕方法：

def convert_fake_unicode_to_real_unicode(string):
    return ''.join(map(chr, map(ord, string))).decode('utf-8')

- A B

当我不使用repr()打印字符串时，我得到的是Ãberg，但我想要的是Öberg。如果我使用decode('utf-8')，我会得到一个UnicodeEncodeError。如果这些字符串是UTF-8的，它不应该写入Ö而不是Ã吗？ - simonbs

1

你需要弄清楚那些变量最初是如何成为unicode类型的。它们实际上是UTF-8编码的ASCII字符，因此应该正确地属于str类型。 - A B

-1 对于（1）join/map/chr/map/ord混乱的问题，以及（2）"UTF-8编码的ascii" - John Machin

1

字符串内容不是Unicode，而是UTF-8编码。

>>> print u'Von D\xc3\xbc'
Von DÃ¼
>>> print 'Von D\xc3\xbc'
Von Dü

>>> print unicode('Von D\xc3\xbc', 'utf-8')
Von Dü
>>>

编辑：

>>> print '\xc3\x96berg' # no unicode identifier, works as expected because it's an UTF-8 encoded string
Öberg
>>> print u'\xc3\x96berg' # has unicode identifier, means print uses the unicode charset now, outputs weird stuff
Ãberg

# Look at the differing object types:
>>> type('\xc3\x96berg')
<type 'str'>
>>> type(u'\xc3\x96berg')
<type 'unicode'>

>>> '\xc3\x96berg'.decode('utf-8') # this command converts from UTF-8 to unicode, look at the unicode identifier in the output
u'\xd6berg'
>>> unicode('\xc3\x96berg', 'utf-8') # this does the same thing
u'\xd6berg'
>>> unicode(u'foo bar', 'utf-8') # trying to convert a unicode string to unicode will fail as expected
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: decoding Unicode is not supported

- Fabian

当我不使用repr()打印字符串时，我得到的是Ãberg，但我想要的是Öberg。如果这些字符串是UTF-8编码，它不应该写成Ö而不是Ã吗？如果我使用unicode，我会得到以下错误：TypeError: decoding Unicode is not supported。 - simonbs

你仍然在使用unicode标识符(u'foo')。它是一个UTF-8编码的字符串，通过使用unicode标识符，你表明它是unicode，但实际上它并不是。这就是为什么你会得到Ã而不是Ö。去掉标识符，问题就解决了。我会更新我的答案以使其更清晰。 - Fabian

@SimonBS 我更新了我的回答。你仍然应该阅读这个链接：http://docs.python.org/howto/unicode.html - Fabian

我刚刚阅读了这个链接。虽然如此，我仍有些困惑。我有一个字符串 myStr，它的类型是 unicode，这意味着它具有 Unicode 标识符。我想要删除此标识符并获得一个 UTF-8 编码的字符串。我该怎么做？我曾认为只需使用 myStr.encode("utf-8") 就可以了，它会返回一个 str 类型的对象，但这会抛出一个 UnicodeDecodeError 错误。 - simonbs

“Those strings are not unicode” -- 在 repr(those_strings) 前面有个 u，它们实际上是 botched unicode，而不是 unicode。他拥有DATA，而不是源代码的字面值。u 是由 repr() 放置在那里的，他不能“删除标识符”。 - John Machin

显示剩余2条评论

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- John Machin · Accepted Answer

防患于未然。您需要找出如何创建这些垃圾。请编辑您的问题以显示创建它的代码，然后我们可以帮助您修复它。看起来好像有人已经这样做了：

your_unicode_string =  original_utf8_encoded_bytestring.decode('latin1')

治疗方法是简单地颠倒过程，然后解码。

correct_unicode_string = your_unicode_string.encode('latin1').decode('utf8')

更新根据您提供的代码，可能的原因是该网站声明其编码为ISO-8859-1（又称为latin1），但实际上它是用UTF-8编码的。请更新您的问题以向我们展示URL。

如果无法显示URL，请阅读BS文档；看起来您需要使用：

BeautifulSoup(web, from_encoding='utf8')