Pandas:保存到Excel时出现编码问题。

12
我有一个类似于这里提到的问题,但是建议的方法都对我无效。
我有一个中等大小的utf-8 .csv文件,其中包含许多非ascii字符。我正在通过某个列的特定值来分割该文件,然后我想将每个获得的数据框保存为保留字符的.xlsx文件。
但是这不起作用,因为我遇到了一个错误:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 7: ordinal not in range(128)

以下是我尝试的方法:

  1. Using xlsxwriter engine explicitly. This doesn't seem to change anything.
  2. Defining a function (below) to change encoding and throw away bad characters. This also doesn't change anything.

    def changeencode(data):
    cols = data.columns
    for col in cols:
    if data[col].dtype == 'O':
        data[col] = data[col].str.decode('utf-8').str.encode('ascii', 'ignore')
    return data   
    
  3. Changing by hand all the offensive chars to some others. Still no effect (the quoted error was obtained after this change).

  4. Encoding the file as utf-16 (which, I believe, is the correct encoding since I want to be able to manipulate the file from within the excel afterwards) doesn't help either.

我认为问题在于文件本身(因为2和3),但我不知道如何解决。非常感谢任何帮助。以下是文件的开头。

"Submitted","your-name","youremail","phone","miasto","cityCF","innemiasto","languagesCF","morelanguages","wiek","partnerCF","messageCF","acceptance-795","Submitted Login","Submitted From","2015-12-25 14:07:58 +00:00","Zózia kryś","test@tes.pl","4444444","Wrocław","","testujemy polskie znaki","Polski","testujemy polskie znaki","44","test","test","1","Justyna","99.111.155.132",

编辑

一些代码(其中一个版本,没有拆分部分):

import pandas as pd
import string
import xlsxwriter

df = pd.read_csv('path-to-file.csv')

with pd.ExcelWriter ('test.xlsx') as writer:
                df.to_excel(writer, sheet_name = 'sheet1',engine='xlsxwriter')

你已经尝试过 df.to_excel(path, encoding='utf8') 了吗? - Stefan
@Stefan 我已经做了,谢谢关心。为了确保,我刚刚又试了一次。仍然没有任何改变。 - jjj
3个回答

12

据说这是我当时使用的pandas版本中的一个错误。 现在,在pandas版本0.19.2中,下面的代码可以毫无问题地保存来自问题的CSV文件(并且具有正确的编码)。
注意:openpyxl模块必须在您的系统上安装。

import pandas as pd
df = pd.read_csv('Desktop/test.csv')
df.to_excel('Desktop/test.xlsx', encoding='utf8')

@greghor 嗯,奇怪。我刚刚安装了相同的版本,它对我有效。你是否已经安装了 openpyxl - jjj
谢谢您的回复,我已经安装了openpyxl 2-5-0。在苦苦挣扎一段时间后,我注意到如果在加载数据时指定编码df=pd.read_csv("test.csv", encoding="utf-8"),那么它就可以正常工作。 - greg hor

2

尝试将包含非ASCII字符的列进行编码为

df['col'] = df['col'].apply(lambda x: unicode(x))

然后使用编码为'utf8',将文件保存为xlsx格式。


谢谢您的建议,不幸的是这并没有起作用。相同的错误被返回,但现在是由.apply行触发的。 - jjj
你能在这里附上CSV文件的片段吗? - Siva Arasu
抱歉回复晚了。我在问题中发布了足以触发错误的文件部分。您需要更多的东西吗? - jjj

0
如果您使用pandas保存csv文件,然后使用win32com将其转换为Excel,会是怎样的呢?大概是这个样子...
import win32com.client
excel = win32com.client.Dispatch("Excel.Application")
excel.Visible = 0

for x in range(10): 
    f = path + str(x)
    # not showing the pandas dataframe creation
    df.to_csv(f+'.csv')
    wb = excel.Workbooks.Open(f+'.csv')
    wb.SaveAs(f+'.xlsx', 51) #xlOpenXMLWorkbook=51

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接