'utf-8'编码无法解码第28个位置的0xa3字节：起始字节无效。

Question

'utf-8'编码无法解码第28个位置的0xa3字节：起始字节无效。

5

我想使用pandas库从Google Drive读取CSV文件，但是遇到了一个问题"UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 28: invalid start byte"

我的代码：

df = pd.read_csv("/content/gdrive/My Drive/data/OnlineRetail.csv")

输出

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()

pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 28: invalid start byte

During handling of the above exception, another exception occurred:

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-6-65a06557fa8d> in <module>()
----> 1 df = pd.read_csv("/content/gdrive/My Drive/data/OnlineRetail.csv")

3 frames
/usr/local/lib/python3.7/dist-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    686     )
    687 
--> 688     return _read(filepath_or_buffer, kwds)
    689 
    690 

/usr/local/lib/python3.7/dist-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    458 
    459     try:
--> 460         data = parser.read(nrows)
    461     finally:
    462         parser.close()

/usr/local/lib/python3.7/dist-packages/pandas/io/parsers.py in read(self, nrows)
   1196     def read(self, nrows=None):
   1197         nrows = _validate_integer("nrows", nrows)
-> 1198         ret = self._engine.read(nrows)
   1199 
   1200         # May alter columns / col_dict

/usr/local/lib/python3.7/dist-packages/pandas/io/parsers.py in read(self, nrows)
   2155     def read(self, nrows=None):
   2156         try:
-> 2157             data = self._reader.read(nrows)
   2158         except StopIteration:
   2159             if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()

pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 28: invalid start byte

- Shadic Mersal

尝试使用 pd.read_csv('...', encoding='utf-8')。 - PCM

在我尝试您的建议之后，出现了“UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 79780: invalid start byte”错误。 - Shadic Mersal

或者尝试使用 encoding=ascii。在这里查看不同类型的编码：https://docs.python.org/2.4/lib/standard-encodings.html - PCM

"TypeError: 期望字符串或类似字节的对象。" 还是不起作用。 :( - Shadic Mersal

1

或者您可以尝试使用read_csv(filename, encoding='unicode_escape')。 - Alexandra Dudkina

3

显而易见的答案是文件并没有被编码为UTF-8。在ISO-8859-1（也称为Latin-1）和cp1252（也称为Windows-1252）中，字节'A3'代表英镑符号(£)，所以如果文件中第28个位置上出现了这个字符，那么很可能是使用了这两种编码之一。请注意，虽然ISO-8859-1可以解码任何内容，但您仍需查看结果以确保没有错误。如果您提供CSV文件前几行的十六进制转储，我们就可以重现此问题。 - Mark Tolonen

3个回答

2

我遇到了同样的问题。可能是因为不是utf-8编码。尝试找出是什么编码方式。你可以通过在Notepad++中打开文件来进行操作。在顶部有一个编码菜单，看看选中了什么编码方式。

- Михаил Полещук

1

你的回答可以通过提供更多支持信息来改进。请编辑以添加进一步的细节，例如引用或文档，以便他人可以确认你的答案是正确的。您可以在帮助中心中找到有关如何编写良好答案的更多信息。 - Community

0

我之前遇到过这种情况（0xa3问题），我认为这是一个编码问题。
如果你的编码设置为'utf-8'或'gbk'，那么你可以尝试encoding='ISO-8859-1'。
祝好运！

- Gelzone

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jwhoakley · Accepted Answer

我刚遇到了同样的问题。这是一个在线服务生成的CSV文件，在Atom记事本中打开，编码为UTF-8。但是当你数字符时，它识别的字符是“�”，而不是应该是“£”。查找并替换所有出现的字符后，问题就解决了。

祝好运。