UnicodeDecodeError: 'utf8'编解码器无法解码第1266个位置的0xba字节：起始字节无效。

Question

UnicodeDecodeError: 'utf8'编解码器无法解码第1266个位置的0xba字节：起始字节无效。

4

我正在尝试使用scikit训练一些文本数据。这段代码在其他电脑上运行没有任何错误，但在我的计算机上出现了错误：

File "/root/Desktop/karim/svn/questo-anso/v5/trials/classify/domain_detection_final/test_classifier_temp.py", line 130, in trainClassifier
    X_train = self.vectorizer.fit_transform(self.data_train.data)
  File "/root/Desktop/karim/software/scikit-learn-0.15.1/sklearn/feature_extraction/text.py", line 1270, in fit_transform
    X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  File "/root/Desktop/karim/software/scikit-learn-0.15.1/sklearn/feature_extraction/text.py", line 808, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
  File "/root/Desktop/karim/software/scikit-learn-0.15.1/sklearn/feature_extraction/text.py", line 741, in _count_vocab
    for feature in analyze(doc):
  File "/root/Desktop/karim/software/scikit-learn-0.15.1/sklearn/feature_extraction/text.py", line 233, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/root/Desktop/karim/software/scikit-learn-0.15.1/sklearn/feature_extraction/text.py", line 111, in decode
    doc = doc.decode(self.encoding, self.decode_error)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xba in position 1266: invalid start byte

我已经查看了类似的帖子，但没有得到帮助。

更新：

self.data_train = self.fetch_data(cache, subset='train')
if not os.path.exists(self.root_dir+"/autocreated/vectorizer.txt"):
                self.vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                                 stop_words='english')
                start_time = time()
                print("Transforming the dataset")
                X_train = self.vectorizer.fit_transform(self.data_train.data)  // Error is here
                joblib.dump(self.vectorizer, self.root_dir+"/autocreated/vectorizer.txt")

- user123

1

0xba 确实是一个无效的起始字节，问题出在哪里？ - n. m.

@n.m.：就算我不知道，代码没问题，但是不知道为什么会显示编码错误。 - user123

1

这可能不是代码的问题，而是输入文本的问题。文本是否为“utf-8”格式？过去同样的代码和文本是否正常工作？（您没有提到文本。） - Fumu 7

@user123，是的 text 是一个变量。 - MaNKuR

@Fumu7：我猜那可能是问题所在，但由于文本内容可以是任何内容，我该如何处理这种情况？ - user123

显示剩余10条评论

2个回答

3

在处理训练数据时出现了问题。解决我的一个方法是使用decode_error='ignore'忽略错误，可能还有其他解决方案。

self.vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words='english',decode_error='ignore')

- user123

7

那是一个糟糕的解决方案。现在你只是隐藏了不能创建正确输入文件的事实。 - Karol S

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Burhan Khalid · Accepted Answer

7

你的文件实际上是用 ISO-8869-1 编码的，而不是 UTF-8。在重新编码之前，你需要正确地解码它。

0xBA 在 ISO-8869-1 中表示编号符号（º）。

- Burhan Khalid

谢谢，伙计。你是指解码我用于训练目的的数据吗？我正在使用20个新闻组数据http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html。我还尝试从维基百科复制文本，但它也出现了相同的错误。 - user123

我也检查了文本的字符编码，它是UTF-8。 - user123

在文本编辑器中，当我们执行“另存为”操作时，它会显示当前的编码方式。即使我尝试进行转换，例如self.data_train.data = unicode(self.data_train.data, "utf-8")，但在这种情况下会出现“TypeError: coercing to Unicode: need string or buffer, list found”的错误提示。 - user123

2

编辑器显示的是编辑器自己的内部设置，而不是实际文件编码，因为它无法知道实际编码。 - n. m.

嘿，谢谢大家，问题已经通过使用 decode_error='ignore' 得到解决。感谢你们的努力。我添加了一个答案，可能会帮助其他人。 - user123

显示剩余2条评论