从文本文件中读取非ASCII字符

Question

从文本文件中读取非ASCII字符

6

我正在使用Python 2.7。我尝试过许多方法，如codecs，但都没有起作用。我该怎么解决这个问题。

myfile.txt

wörd

我的代码

f = open('myfile.txt','r')
for line in f:
    print line
f.close()

输出

s\xc3\xb6zc\xc3\xbck

在Eclipse和命令窗口上输出相同。我使用的是Win7。当我不从文件中读取时，任何字符都没有问题。

- Rckt

3

你期望得到什么结果？从技术角度来说，Python 已经成功地读取了该文件。 - srgerg

为什么你要逐个字符地打印出行？为什么不直接使用 for line in f: print line 呢？当我这样做时，它按预期打印了 "söcük"。 - srgerg

我尝试过了，但是不起作用。它打印出了 s\xc3\xb6zc\xc3\xbck。 - Rckt

2

Python 运行得很好，问题在于您的终端窗口 / 控制台的编码。 - Hamish

1

你确定你是在Windows 7的“命令提示符”（黑屏）中打印，并且实际上看到s\xc3\xb6zc\xc3\xbck就像那样打印，包括反斜杠x c 3等吗？？你真的确定你正在执行print line而不是print repr(line)吗？？ - John Machin

显示剩余2条评论

3个回答

7

首先 - 检测文件的编码


  from chardet import detect
  encoding = lambda x: detect(x)['encoding']
  print encoding(line)

然后 - 将其转换为Unicode或默认编码的字符串：


  n_line=unicode(line,encoding(line),errors='ignore')
  print n_line
  print n_line.encode('utf8')

- lavrton

1

这是终端编码的问题。尝试使用与文件相同的编码配置您的终端。我建议您使用UTF-8。

顺便说一句，为了避免问题，将所有输入输出进行解码和编码是一个好习惯：

f = open('test.txt','r')    
for line in f:
    l = unicode(line, encoding='utf-8')# decode the input                                                                                  
    print l.encode('utf-8') # encode the output                                                                                            
f.close()

- jgomo3

现在我明白为什么他们要在3.0版本中将UTF-8作为标准了。(PEP 3120) - mgold

2

@mgold：PEP 3120 主要关注源代码（.py）文件的编码，与 OP 在输入和/或输出编码方面遇到的问题无关。 - John Machin

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Biruk Demelash · Accepted Answer

import codecs
#open it with utf-8 encoding 
f=codecs.open("myfile.txt","r",encoding='utf-8')
#read the file to unicode string
sfile=f.read()

#check the encoding type
print type(file) #it's unicode

#unicode should be encoded to standard string to display it properly
print sfile.encode('utf-8')
#check the type of encoded string

print type(sfile.encode('utf-8'))