Python：打开和读取一个包含德语umlauts的文件作为Unicode

Question

Python：打开和读取一个包含德语umlauts的文件作为Unicode

4

我已经写好了从文本文件读取单词并将其输入到SQLite数据库中，并将它们作为字符串处理的程序。但是，我需要输入一些包含德语umlauts的单词：ä，ö，ü，ß。

这里是一段准备好的代码：

我尝试了使用 # -- coding: iso-8859-15 -- 和 # -- coding: utf-8 -- 两种方法，但没有区别！

    # -*- coding: iso-8859-15 -*-
    import sqlite3
    
    dbname = 'sampledb.db'
    filename ='text.txt'


    con = sqlite3.connect(dbname)
    cur = con.cursor()
    cur.execute('''create table IF NOT EXISTS table1 (id INTEGER PRIMARY KEY,name)''')    

    #f=open(filename)
    #text = f.readlines()
    #f.close()

    text = u'süß'

    print (text)
    cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,))       

    con.commit()

    sentence = "The name is: %s" %(text,)

    print (sentence)
    f.close()
    con.close()

上面的代码可以正常运行。但我需要从一个包含单词“süß”的文件中读取文本。因此，当我取消注释三行（f.open(filename) ...）并注释掉text = u'süß'时，它会出现错误。

    sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type.

我尝试使用codecs模块来读取一个utf-8、iso-8859-15编码的文件。但是我无法解码成我需要在代码末尾完成句子的'süß'字符串。

我曾经尝试在将其插入数据库之前将其解码为utf-8。这种方法有效，但我不能将其用作字符串。

有没有一种方法可以从文件中导入'süß'并将其用于插入sqlite和用作字符串？

更多详细信息：

这里我添加更多细节以作说明。包含单词'süß'的文本文件保存为utf-8格式。使用codecs.open(filename, 'r', 'utf-8')和text=f.read()，我将文件读取为unicode字符串u'\ufeffs\xfc\xdf'。在sqlite3中插入此unicode字符串非常顺利：cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,))。

问题出在这里：sentence = "The name is: %s" %(text,)会给出u'The name is: \ufeffs\xfc\xdf'。而我也需要通过print(text)输出我的结果süß，但是print(text)会引发错误UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>。

谢谢。

- Amin

1

编码参数应该在您的text文本中产生了很大的差异。 - Mark Ransom

2

澄清一下：模块顶部的编码声明会影响源代码中指定的text = u'süß'。但它对从文件中读取的文本没有任何影响。你可以使用codecs.open()来处理后者。 - jfs

readlines 返回一个列表。使用 f.read().strip() 将文件的文本作为字符串获取。然后你可以开始担心编码问题。 - alexis

2个回答

5

我可以解决这个问题。感谢你们的帮助。

这就是它：

# -*- coding: iso-8859-1 -*-

import sys 
import codecs
import sqlite3

f = codecs.open("suess_sweet.txt", "r", "utf-8")    # suess_sweet.txt file contains two
text_in_unicode = f.read()                          # comma-separated words: süß, sweet 
f.close()

stdout_encoding = sys.stdout.encoding or sys.getfilesystemencoding()

con = sqlite3.connect('dict1.db')
cur = con.cursor()
cur.execute('''create table IF NOT EXISTS table1 (id INTEGER PRIMARY KEY,German,English)''')    

[ger,eng] = text_in_unicode.split(',')

cur.execute('''insert into table1 (id,German,English) VALUES (NULL,?,?)''',(ger,eng))       

con.commit()

sentence = "The German word is: %s" %(ger,)

print sentence.encode(stdout_encoding)

con.close()

我从这个页面得到了一些帮助（它是德语）

输出结果为：

The German word is: ?süß

仍然存在一个小问题，就是“？”。我认为在编码后，Unicode u' 会被替换成 ?。 sentence 的输出结果如下：

>>> sentence
u'The German word is: \ufeffs\xfc\xdf '

编码的句子如下：

>>> sentence.encode(stdout_encoding)
'The German word is: ?s\xfc\xdf '

我想到了一个简单的解决方案，摆脱问号就是使用replace函数：

所以这不是我想象的那样。

sentence = "The German word is: %s" %(ger,)
to_print = sentence.encode(stdout_encoding)
to_print = to_print.replace('?','')

>>> print(to_print)
The German word is: süß

Thank you SO :)

- Amin

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mark Ransom · Accepted Answer

当您打开并读取文件时，获得的是8位字符串而不是Unicode。在Python 2中，如果想要获得Unicode字符串，则需要使用codecs.open打开文件：

f=codecs.open(filename, 'r', 'utf-8')

希望您已经转向Python 3，其中编码已经放入常规的open调用中。此外，除非您使用'b'标志进行二进制打开，否则您将始终获得Unicode字符串而不是8位二进制字符串，并且如果您未指定编码，则将使用默认编码。

f=open(filename, 'r', encoding='utf-8')

当然，根据文件的编写方式，您可能需要使用'iso-8859-15'代替。

编辑：您的测试代码和注释掉的代码之间的一个重大区别在于，从文件中读取会产生一个列表，而测试是一个单独的字符串。也许您的问题与Unicode无关。尝试在测试代码中进行此替换，看看是否会产生相同的错误：

text = [u'süß']

很遗憾，我在Python中没有足够的SQL经验来帮助你进一步。

此外，当您打印一个list而不是单个字符串时，Unicode字符将被替换为它们的等效转义序列。要查看字符串的真实样子，请逐个打印它们。如果您感到好奇，这就是__str__和__repr__之间的区别。

编辑2：字符u'\ufeff'被称为字节顺序标记或BOM，某些编辑器会插入该字符以表示文件真正采用UTF-8编码。在使用该字符串之前，您应该将其删除。文件开头只应该有一个BOM字符。例如，请参见使用Python读取带BOM字符的Unicode文件数据