Python读取字符时的UTF-8问题

Question

Python读取字符时的UTF-8问题

8

我可以帮助您翻译以下内容，涉及IT技术。请注意保留HTML标签，并按照格式要求返回结果：

我正在使用Python 2.5版本。出了什么问题？我理解错了什么？如何解决？

in.txt：

Stäckövérfløw

code.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
print """Content-Type: text/plain; charset="UTF-8"\n"""
f = open('in.txt','r')
for line in f:
    print line
    for i in line:
        print i,
f.close()

输出：

Stäckövérfløw

S t � � c k � � v � � r f l � � w

- jacob

5个回答

2

请使用codecs.open代替，这对我很有效。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
print """Content-Type: text/plain; charset="UTF-8"\n"""
f = codecs.open('in','r','utf8')
for line in f:
    print line
    for i in line:
        print i,
f.close()

- mhawke

1

看看这个：

# -*- coding: utf-8 -*-
import pprint
f = open('unicode.txt','r')
for line in f:
    print line
    pprint.pprint(line)
    for i in line:
        print i,
f.close()

它返回以下内容：

Stäckövérfløw
'St\xc3\xa4ck\xc3\xb6v\xc3\xa9rfl\xc3\xb8w'
S t ? ? c k ? ? v ? ? r f l ? ? w

问题在于该文件仅被读取为字节字符串。遍历它们会将多字节字符拆分为无意义的字节值。

- mikl

1

print c,

添加一个“空白字符”，并将正确的UTF-8序列分解为不正确的序列。因此，除非您向输出写入单个字节，否则此方法将无法正常工作。

sys.stdout.write(i)

- Artyom

0

有时候我们可能只想使用

f = open('in.txt','r')
for line in f:
    print line
    for i in line.decode('utf-8'):
        print i,
f.close()

- j1k00

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Miles · Accepted Answer

for i in line:
    print i,

当你读取文件时，所读取的字符串是一串字节。for循环逐个字节迭代。对于UTF-8编码的字符串而言，这会导致问题，因为非ASCII字符由多个字节表示。如果你想要使用Unicode对象，其中字符是基本单元，那么你应该使用

import codecs
f = codecs.open('in', 'r', 'utf8')

如果 sys.stdout 没有适当的编码设置，您可能需要将其包装：

sys.stdout = codecs.getwriter('utf8')(sys.stdout)