在Python中迭代Unicode字符串并与Unicode字典进行比较

Question

在Python中迭代Unicode字符串并与Unicode字典进行比较

9

我有两个Python字典，包含有关日语单词和字符的信息：

vocabDic : contains vocabulary, key: word, value: dictionary with information about it

kanjiDic : contains kanji ( single japanese character ), key: kanji, value: dictionary with information about it

Now I would like to iterate through each character of each word in the vocabDic and look up this character in the kanji dictionary. My goal is to create a csv file which I can then import into a database as join table for vocabulary and kanji.
My Python version is 2.6
My code is as following:

kanjiVocabJoinWriter = csv.writer(open('kanjiVocabJoin.csv', 'wb'), delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
kanjiVocabJoinCount = 1

#loop through dictionary
for key, val in vocabDic.iteritems():
    if val['lang'] is 'jpn': # only check japanese words
        vocab = val['text']
        print vocab
        # loop through vocab string
        for v in vocab:
             test = kanjiDic.get(v)
             print v
             print test
             if test is not None:
                print str(kanjiVocabJoinCount)+','+str(test['id'])+','+str(val['id'])
                kanjiVocabJoinWriter([str(kanjiVocabJoinCount),str(test['id']),str(val['id'])])
                kanjiVocabJoinCount = kanjiVocabJoinCount+1

如果我将变量打印到命令行，结果如下：
vocab：有效，以日语打印
v（for循环中词汇的一个字符）：�
test（查找汉字字典中的字符）：无

在我看来，for循环弄乱了编码。
我尝试了各种函数（decode、encode等），但迄今为止没有运气。
你有什么办法能让这个程序工作起来吗？
非常感谢你的帮助。

- daniela

2

你能使用Python 3吗？它的Unicode更好。 - mmmmmm

1

或者，from __future__ import unicode_literals？ - utdemir

非常感谢！更新到Python 3解决了问题:D - daniela

有一个全面的辩论，关于Python 3的“一切都是unicode”是否更好，所以我不会轻率地像Mark那样发表一些言论。我的观点是它并不比较好，下面的unutbu回复是一个更好的方法。Python 2.x与默认编码设置为utf8相结合，是在Python中管理unicode字符串的优越解决方案。 - thomdask

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- unutbu · Accepted Answer

根据您描述的问题，看起来 vocab 是一个编码为str对象，而不是一个unicode对象。

为了具体说明，假设vocab等于用utf-8编码的u'債務の天井'：

In [42]: v=u'債務の天井'
In [43]: vocab=v.encode('utf-8')   # val['text']
Out[43]: '\xe5\x82\xb5\xe5\x8b\x99\xe3\x81\xae\xe5\xa4\xa9\xe4\xba\x95'

如果您循环遍历编码后的str对象，每次会获取一个字节：\xe5、 \x82、\xb5等。但是如果您循环遍历Unicode对象，则会每次获取一个Unicode字符：

In [45]: for v in u'債務の天井':
   ....:     print(v)    
債
務
の
天
井

请注意，第一个使用utf-8编码的Unicode字符需要3个字节：

In [49]: u'債'.encode('utf-8')
Out[49]: '\xe5\x82\xb5'

这就是为什么循环遍历字节，一次打印一个字节（例如print \xe5）无法打印出可识别的字符。

因此看起来您需要对str对象进行解码并使用unicode对象进行操作。您没有提到您的str对象使用的编码方式。如果是utf-8，则可以按如下方式解码：

vocab=val['text'].decode('utf-8')

如果你不确定val['text']所使用的编码方式，可以将下面命令的输出结果发布：

print(repr(vocab))

或许我们可以猜测编码方式。