Python编码UTF-8

Question

Python编码UTF-8

pythonunicodeencodingutf-8

52

我正在使用Python编写一些脚本。我创建了一个字符串并将其保存在文件中。这个字符串包含了很多数据，来自于目录结构和文件名。

根据convmv的说法，我的整个目录结构都是UTF-8编码的。

我希望保持所有内容都是UTF-8编码，因为我之后会将它保存到MySQL中。但是现在，在MySQL中出现了一些字符问题（比如é或è - 我是法国人）。

我希望Python始终将字符串作为UTF-8使用。我在网上阅读了一些信息，然后按照以下方式进行操作。

我的脚本从这里开始：

 #!/usr/bin/python
 # -*- coding: utf-8 -*-
 def createIndex():
     import codecs
     toUtf8=codecs.getencoder('UTF8')
     #lot of operations & building indexSTR the string who matter
     findex=open('config/index/music_vibration_'+date+'.index','a')
     findex.write(codecs.BOM_UTF8)
     findex.write(toUtf8(indexSTR)) #this bugs!

当我执行时，这是答案：UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2171: ordinal not in range(128)

编辑：我明白了，在我的文件中，重音符号写得很好。创建此文件后，我将其读取并写入MySQL。但我不明白为什么会出现编码问题。我的MySQL数据库是utf8的，或者似乎是SQL查询SHOW variables LIKE 'char%'只返回utf8或二进制。

我的函数看起来像这样：

#!/usr/bin/python
# -*- coding: utf-8 -*-

def saveIndex(index,date):
    import MySQLdb as mdb
    import codecs

    sql = mdb.connect('localhost','admin','*******','music_vibration')
    sql.charset="utf8"
    findex=open('config/index/'+index,'r')
    lines=findex.readlines()
    for line in lines:
        if line.find('#artiste') != -1:
            artiste=line.split('[:::]')
            artiste=artiste[1].replace('\n','')

            c=sql.cursor()
            c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom="'+artiste+'"')
            nbr=c.fetchone()
            if nbr[0]==0:
                c=sql.cursor()
                iArt+=1
                c.execute('INSERT INTO artistes(nom,status,path) VALUES("'+artiste+'",99,"'+artiste+'/")'.encode('utf8')

一个艺术家在文件中被很好地展示，但是写入BDD时出现了问题。问题是什么？

- vekah

你的Python示例代码无效，至少有两个地方存在语法错误。请先修复这些错误，好吗？ - Martijn Pieters

你是否将文件保存为 utf-8 而不是 ascii 文件？ - QuentinUK

2个回答

3

很不幸，string.encode()方法并不总是可靠的。查看此主题以获取更多信息。

- Ev Haus

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martijn Pieters · Accepted Answer

您不需要对已经编码的数据进行编码。当您尝试这样做时，Python 会首先尝试将其解码为 unicode，然后才能将其重新编码为 UTF-8。这就是此处失败的原因：

>>> data = u'\u00c3'            # Unicode data
>>> data = data.encode('utf8')  # encoded to UTF-8
>>> data
'\xc3\x83'
>>> data.encode('utf8')         # Try to *re*-encode it
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

直接将数据写入文件，无需对已编码的数据进行编码。

如果您选择构建unicode值，则确实需要对其进行编码，以便能够写入文件。您需要使用codecs.open()代替，它返回一个文件对象，该对象将为您编码Unicode值为UTF-8。

除非您必须支持无法否则读取UTF-8的Microsoft工具（如MS记事本），否则您也确实不希望写出UTF-8 BOM。

对于您的MySQL插入问题，您需要做两件事：

Add charset='utf8' to your MySQLdb.connect() call.

Use unicode objects, not str objects when querying or inserting, but use sql parameters so the MySQL connector can do the right thing for you:

artiste = artiste.decode('utf8')  # it is already UTF8, decode to unicode

c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))

# ...

c.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))

如果您使用codecs.open()自动解码内容，它可能会更有效：

import codecs

sql = mdb.connect('localhost','admin','ugo&(-@F','music_vibration', charset='utf8')

with codecs.open('config/index/'+index, 'r', 'utf8') as findex:
    for line in findex:
        if u'#artiste' not in line:
            continue

        artiste=line.split(u'[:::]')[1].strip()

    cursor = sql.cursor()
    cursor.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))
    if not cursor.fetchone()[0]:
        cursor = sql.cursor()
        cursor.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))
        artists_inserted += 1

你可能需要了解Unicode、UTF-8和编码。我可以推荐以下文章：

Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky