Python：Unicode问题

Question

Python：Unicode问题

13

我正在尝试解码从文件中获取的字符串：

file = open ("./Downloads/lamp-post.csv", 'r')
data = file.readlines()
data[0]

'The content is a string of non-English characters mixed with some English words and dates, representing monthly search data for different months from January 2010 to August 2010. The data is extracted from a webpage and the average CPC (Cost Per Click) is estimated. Adding ignore does not help when decoding the string.'

- Oleg Tarasenko

我的答案可以正常运行，但这取决于你想忽略还是替换无法解码的字符。 - orlp

3个回答

11

这个文件是一个以UTF-16-LE编码的文件，并带有初始BOM。

import codecs

fp= codecs.open("a", "r", "utf-16")
lines= fp.readlines()

- tzot

-1 胡言乱语。>>> raw = '\xff\xfeK\x00e\x00y\x00w\x00o\x00r\x00d\x00'

raw.decode('utf_16le') u'\ufeff关键字' raw.decode('utf_16') u'关键字'

- John Machin

3

编辑

既然您发布了2.7版本，则以下是2.7版本的解决方案：

file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "replace") for line in file]

忽略无法解码的字符：

file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "ignore") for line in file]

- orlp

在[21]：文件=打开（“./ Downloads / lamp-post.csv”，'r'）在[22]：数据= [line.decode() for line in file]

<type 'exceptions.UnicodeDecodeError'> Traceback（most recent call last）/ Users / oleg / <ipython console> in <module>（）<type 'exceptions.UnicodeDecodeError'>：'ascii'编解码器无法在位置0中解码字节0xff：序数不在范围内（128）在[23]：数据= [line.decode() for line in file] - Oleg Tarasenko

哦，你想忽略那些无效字符还是替换它们？我编辑了我的答案，假设是替换。 - orlp

在Python 3中，默认情况下以unicode模式打开文件，因此它们将不具有解码方法。 - Thomas K

1

我取消了踩的操作，但在Python 3中还有更好的方法：使用open函数的encoding参数。open("Downloads/lamp-post.csv", encoding="utf-16")。 - Thomas K

@Oleg：你确定是2.7吗？/opt/local/lib/python2.5/？ - Thomas K

显示剩余2条评论

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Sven Marnach · Accepted Answer

这看起来像是UTF-16编码的数据。所以尝试：

data[0].rstrip("\n").decode("utf-16")

编辑（针对您的更新）：尝试一次解码整个文件，即

data = open(...).read()
data.decode("utf-16")

问题在于UTF-16中的换行符是"\n\x00"，但使用readlines()会将其拆分为"\n"，导致"\x00"字符出现在下一行。