在Python中写入UTF-8格式的文件

Question

在Python中写入UTF-8格式的文件

pythonutf-8character-encodingbyte-order-mark

248

我对codecs.open函数感到非常困惑。当我执行以下操作时：

file = codecs.open("temp", "w", "utf-8")
file.write(codecs.BOM_UTF8)
file.close()

这给我报错：

UnicodeDecodeError: 'ascii'编解码器无法解码位置0的字节0xef：范围之外的序数(128)

如果我执行以下操作：

file = open("temp", "w")
file.write(codecs.BOM_UTF8)
file.close()

它很好用。

问题是为什么第一种方法失败了？如何插入bom？

如果第二种方法是正确的方式，那么使用codecs.open(filename, "w", "utf-8")有什么意义呢？

- John Jiang

64

请不要在 UTF-8 中使用 BOM。请务必遵守。 - tchrist

10

@tchrist 嗯？为什么不行？ - salmatron

12

@SalmanPK，UTF-8编码不需要BOM，并且BOM会增加复杂性（例如，无法直接连接带有BOM的文件并得到有效文本）。请参见此[问答]（https://dev59.com/enE95IYBdhLWcg3wn_f2）；不要错过问题下面的大评论。 - Alois Mahdal

8个回答

200

阅读以下内容：http://docs.python.org/library/codecs.html#module-encodings.utf_8_sig

执行以下操作

with codecs.open("test_output", "w", "utf-8-sig") as temp:
    temp.write("hi mom\n")
    temp.write(u"This has ♭")

生成的文件是带有预期BOM的UTF-8格式。

- S.Lott

2

谢谢。这个方法可行（Windows 7 x64，Python 2.7.5 x64）。当您以“a”模式（追加）打开文件时，此解决方案效果很好。 - Mohamad Fakih

这对我来说不起作用，在Windows上使用Python 3。我不得不改为使用open(file_name，'wb') as bomfile：bomfile.write(codecs.BOM_UTF8)，然后重新打开文件进行附加。 - Dustin Andrews

也许加上 temp.close() ？ - user2905353

2

@user2905353：不需要；这是由open的上下文管理器处理的。 - matheburg

解决我的问题。将Mac OS上的Python脚本成功复制到正在运行的Windows系统。 - Zeus

解决我的问题。将Mac OS上的Python脚本成功复制到运行Windows的电脑上。 - undefined

61

非常简单，只需要使用这个。 不需要任何库。

with open('text.txt', 'w', encoding='utf-8') as f:
    f.write(text)

- Kamran Gasimov

12

@S-Lott提供了正确的步骤，但是在扩展Unicode问题时，Python解释器可以提供更多的见解。

Jon Skeet对于codecs模块是正确的（不寻常）-它包含字节串：

>>> import codecs
>>> codecs.BOM
'\xff\xfe'
>>> codecs.BOM_UTF8
'\xef\xbb\xbf'
>>>

再挑一个毛病，BOM有一个标准的Unicode名称，可以输入为：

>>> bom= u"\N{ZERO WIDTH NO-BREAK SPACE}"
>>> bom
u'\ufeff'

它也可通过 unicodedata 访问：

>>> import unicodedata
>>> unicodedata.lookup('ZERO WIDTH NO-BREAK SPACE')
u'\ufeff'
>>>

- gimel

10

我使用 *nix 命令将一个未知字符集的文件转换为 utf-8 文件

# -*- encoding: utf-8 -*-

# converting a unknown formatting file in utf-8

import codecs
import commands

file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)

file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')

for l in file_stream:
    file_output.write(l)

file_stream.close()
file_output.close()

- Ricardo

1

使用 # coding: utf8 代替 # -*- coding: utf-8 -*-，这样更容易记忆。 - show0k

我非常有兴趣看到类似的东西在Windows上运行。 - paradox

1

使用Python 3.4及以上版本，引用pathlib：

import pathlib
pathlib.Path("text.txt").write_text(text, encoding='utf-8') #or utf-8-sig for BOM

- celsowm

0

    def read_files(file_path):
    
        with open(file_path, encoding='utf8') as f:
            text = f.read()
            return text

**OR (AND)**

    def read_files(text, file_path):
    
        with open(file_path, 'rb') as f:
            f.write(text.encode('utf8', 'ignore'))

 **OR**

    document = Document()
    document.add_heading(file_path.name, 0)
        file_path.read_text(encoding='UTF-8'))
            file_content = file_path.read_text(encoding='UTF-8')
            document.add_paragraph(file_content)

**OR**

    def read_text_from_file(cale_fisier):
        text = cale_fisier.read_text(encoding='UTF-8')
        print("what I read: ", text)
        return text # return written text
    
    def save_text_into_file(cale_fisier, text):
        f = open(cale_fisier, "w", encoding = 'utf-8') # open file
        print("Ce am scris: ", text)
        f.write(text) # write the content to the file

**OR**

    def read_text_from_file(file_path):
        with open(file_path, encoding='utf8', errors='ignore') as f:
            text = f.read()
            return text # return written text

**OR**

    def write_to_file(text, file_path):
        with open(file_path, 'wb') as f:
            f.write(text.encode('utf8', 'ignore')) # write the content to the file

此处放置源代码:

- Just Me

-3

如果您正在使用Pandas I/O方法，例如pandas.to_excel()，请添加一个编码参数，例如：

pd.to_excel("somefile.xlsx", sheet_name="export", encoding='utf-8')

我认为这对大多数国际字符都有效。

- RogerZ

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jon Skeet · Accepted Answer

我认为问题在于codecs.BOM_UTF8是一个字节字符串，而不是Unicode字符串。我怀疑文件处理程序正在尝试根据“我应该将Unicode编写为UTF-8编码的文本，但你给了我一个字节字符串！”来猜测你实际意思。

尝试直接编写字节顺序标记（即Unicode U+FEFF）的Unicode字符串，以便文件将其作为UTF-8进行编码：

import codecs

file = codecs.open("lol", "w", "utf-8")
file.write(u'\ufeff')
file.close()

（这似乎给出了正确的答案 - 一个带有字节EF BB BF的文件。）

编辑：S.Lott的建议是使用"utf-8-sig"作为编码比自己显式编写BOM更好，但我将保留这个答案，因为它解释了之前出了什么问题。