UnicodeDecodeError: 'ascii'编解码器无法在位置40解码字节0xc3:序数不在128的范围内。

3

我正在尝试将字典的具体内容保存到文件中,但在尝试写入时,出现以下错误:

Traceback (most recent call last):
  File "P4.py", line 83, in <module>
    outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128)

以下是代码:

from collections import Counter

with open("corpus.txt") as inf:
    wordtagcount = Counter(line.decode("latin_1").rstrip() for line in inf)

with open("lexic.txt", "w") as outf:
    outf.write('Palabra\tTag\tApariciones\n'.encode("utf-8"))
    for word,count in wordtagcount.iteritems():
        outf.write(u"{}\t{}\n".format(word, count).encode("utf-8"))
"""
2) TAGGING USING THE MODEL
Dados los ficheros de test, para cada palabra, asignarle el tag mas
probable segun el modelo. Guardar el resultado en ficheros que tengan
este formato para cada linea: Palabra  Prediccion
"""
file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
data=file.readlines()
file.close()
diccionario = {}

"""
In this portion of code we iterate the lines of the .txt document and we create a dictionary with a word as a key and a List as a value
Key: word
Value: List ([tag, #ocurrencesWithTheTag])
"""
for linea in data:
    aux = linea.decode('latin_1').encode('utf-8')
    sintagma = aux.split('\t')  # Here we separate the String in a list: [word, tag, ocurrences], word=sintagma[0], tag=sintagma[1], ocurrences=sintagma[2]
    if (sintagma[0] != "Palabra" and sintagma[1] != "Tag"): #We are not interested in the first line of the file, this is the filter
        if (diccionario.has_key(sintagma[0])): #Here we check if the word was included before in the dictionary
            aux_list = diccionario.get(sintagma[0]) #We know the name already exists in the dic, so we create a List for every value
            aux_list.append([sintagma[1], sintagma[2]]) #We add to the list the tag and th ocurrences for this concrete word
            diccionario.update({sintagma[0]:aux_list}) #Update the value with the new list (new list = previous list + new appended element to the list)
        else: #If in the dic do not exist the key, que add the values to the empty list (no need to append)
            aux_list_else = ([sintagma[1],sintagma[2]])
            diccionario.update({sintagma[0]:aux_list_else})

"""
Here we create a new dictionary based on the dictionary created before, in this new dictionary (diccionario2) we want to keep the next
information:
Key: word
Value: List ([suggestedTag, #ocurrencesOfTheWordInTheDocument, probability])

For retrieve the information from diccionario, we have to keep in mind:

In case we have more than 1 Tag associated to a word (keyword ), we access to the first tag with keyword[0], and for ocurrencesWithTheTag with keyword[1],
from the second case and forward, we access to the information by this way:

diccionario.get(keyword)[2][0] -> with this we access to the second tag
diccionario.get(keyword)[2][1] -> with this we access to the second ocurrencesWithTheTag
diccionario.get(keyword)[3][0] -> with this we access to the third tag
...
..
.
etc.
"""
diccionario2 = dict.fromkeys(diccionario.keys())#We create a dictionary with the keys from diccionario and we set all the values to None
with open("estimation.txt", "w") as outfile:
    for keyword in diccionario:
        tagSugerido = unicode(diccionario.get(keyword[0]).decode('utf-8')) #tagSugerido is the tag with more ocurrences for a concrete keyword
        maximo = float(diccionario.get(keyword)[1]) #maximo is a variable for the maximum number of ocurrences in a keyword
        if ((len(diccionario.get(keyword))) > 2): #in case we have > 2 tags for a concrete word
            suma = float(diccionario.get(keyword)[1])
            for i in range (2, len(diccionario.get(keyword))):
                suma += float(diccionario.get(keyword)[i][1])
                if (diccionario.get(keyword)[i][1] > maximo):
                    tagSugerido = unicode(diccionario.get(keyword)[i][0]).decode('utf-8'))
                    maximo = float(diccionario.get(keyword)[i][1])
            probabilidad = float(maximo/suma);
            diccionario2.update({keyword:([tagSugerido, suma, probabilidad])})

        else:
            diccionario2.update({keyword:([diccionario.get(keyword)[0],diccionario.get(keyword)[1], 1])})

        outfile.write(u"{}\t{}\n".format(keyword, tagSugerido).encode("utf-8"))

期望的输出将如下所示:
keyword(String)  tagSugerido(String):
Hello    NC
Friend   N
Run      V
...etc

冲突的行是:
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))

谢谢。
2个回答

2

由于您没有提供一个简洁明了的代码来说明您的问题,我将给您一些关于错误应该是什么的一般建议:

如果您遇到解码错误,那就是因为tagSugerido被读成了ASCII而不是Unicode。要解决这个问题,您应该进行以下操作:

tagSugerido = unicode(diccionario.get(keyword[0]).decode('utf-8'))

将其存储为Unicode。

然后,在write()阶段可能会出现编码错误,您应该按照以下方式修复您的写入:

outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))

should be:

outfile.write(u"{}\t{}\n".format(keyword, tagSugerido.encode("utf-8")))

我刚刚回答了一个非常类似的问题,请参考。当使用Unicode字符串时,请转换到Python3,这将使您的生活更轻松!
如果您暂时无法转换到Python3,可以使用python-future导入语句让您的Python2几乎像Python3一样。
from __future__ import absolute_import, division, print_function, unicode_literals

注意:不要这样做:
file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
data=file.readlines()
file.close()

如果使用readlines时出现错误而无法正确关闭文件描述符,你应该改用以下方法:

with open("lexic.txt", "r") as f:
    data=f.readlines()

这将确保即使失败也会关闭文件。

N.B.2:避免使用file,因为它是你掩盖的Python类型,而是使用flexic_file


谢谢您的帮助,我根据您的提示更新了代码,但现在出现了这个错误: File "P4.py", line 75 outfile.write(u"{}\t{}\n".format(keyword, tagSugerido.encode("utf-8"))) ^ SyntaxError: invalid syntax - Cheknov
@Gerard没有给我无效的语法。也许你在另一行上有问题? - Dan Getz
@gerard 请创建一个SSCCE,在注释中包含一个错误(以一种让人无法知道^指向哪个字符的方式),而且解析两个我们没有的文件的代码并没有帮助。看看我的另一个答案,展示如何用几行代码和几个单行文件产生错误。 - zmo
并且,@Gerard,请考虑切换到Python3,因为所有的Unicode问题都将消失。为了使您的代码兼容,进行少量更改绝对是值得的! - zmo
如果你学会了正确利用 Python 3 的优势,而且向你的老师证明教授过时工具是没有意义的话,那么他只能接受你的代码。如果你使用 futures 或 six 包使代码同时兼容 Python 2 和 Python 3,那么你就成功地规避了他差劲的作业选择。 - zmo
显示剩余3条评论

2

像zmo建议的:

outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))

should be:

outfile.write(u"{}\t{}\n".format(keyword, tagSugerido.encode("utf-8")))

关于Python 2中的Unicode注意事项

你的软件内部应该只使用unicode字符串,在输出时再转换为特定编码。

为了避免反复犯同样的错误,你应该确保理解了asciiutf-8编码以及Python中strunicode对象之间的区别。

ASCII和UTF-8编码之间的区别:

Ascii只需要一个字节来表示所有可能的字符,而UTF-8则需要最多四个字节来表示完整的字符集。

ascii (default)
1    If the code point is < 128, each byte is the same as the value of the code point.
2    If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)

utf-8 (unicode transformation format)
1    If the code point is <128, it’s represented by the corresponding byte value.
2    If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
3    Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.

str和unicode对象的区别:

可以说,str基本上是一个字节串,而unicode是一个Unicode字符串。两者都可以具有不同的编码方式,如ascii或utf-8。

str vs. unicode
1   str     = byte string (8-bit) - uses \x and two digits
2   unicode = unicode string      - uses \u and four digits
3   basestring
        /\
       /  \
    str    unicode

如果您遵循一些简单的规则,应该可以很好地处理不同编码(如ascii或utf-8或您必须使用的任何编码)的str/unicode对象:

Rules
1    encode(): Gets you from Unicode -> bytes
     encode([encoding], [errors='strict']), returns an 8-bit string version of the Unicode string,
2    decode(): Gets you from bytes -> Unicode
     decode([encoding], [errors]) method that interprets the 8-bit string using the given encoding
3    codecs.open(encoding=”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common).
4    u”: Makes your string literals into Unicode objects rather than byte sequences.
5    unicode(string[, encoding, errors]) 

警告:不要在字节上使用encode()或在Unicode对象上使用decode()

再次强调:软件应该只在内部使用Unicode字符串,在输出时转换为特定的编码。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接