如何使用Python读取一个UTF-8编码的文本文件

Question

如何使用Python读取一个UTF-8编码的文本文件

11

我需要分析一个泰米尔语文本文件（utf-8编码）。我在Python的IDLE界面上使用nltk包。当我尝试在界面上读取文本文件时，出现了以下错误。请问如何避免这个错误？

corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()
  File "C:\Users\Customer\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 33: character maps to <undefined>

- Ramprashanth

我还没有完全阅读你的问题，但是... 如果你有一堆字节，你可以使用 your_bytes.decode("UTF-8") 将它们解码成字符串。 - byxor

1

哪个Python版本？ - Antonis Christofides

根据回溯信息，我推断是Python3。 - Robᵩ

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Antonis Christofides · Accepted Answer

由于您正在使用Python 3，只需向open()添加encoding参数：

corpus = open(
    r"C:\Users\Customer\Desktop\DISSERTATION\ettuthokai.txt", encoding="utf-8"
).read()