组合pandas和shutil时出现解码错误

3
我正在尝试使用shutil.copytree和pandas(apply函数)复制一堆目录。在检查日志时,我注意到由于以下错误,无法复制某些文件:[Errno 2] No such file or directory:PATH,即使路径名是有效的。经过进一步检查,发现Í字符被改成了\xb4,这就解释了为什么找不到该文件。
我尝试遵循此帖子中的建议,尝试将列转换为Unicode。然而,这导致出现以下错误: UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 20: ordinal not in range(128)
import pandas as pd
import shutil

def copy_files(row):
    try:
        shutil.copy(row['1'], row['2'])
        return 'DONE'
    except Exception as e:
        return str(e)

df = pd.DataFrame({'1':['Y:\project\Test\1\RAÍ.pdf'],'2': 
['Y:\project\Test\2\RAÍ.pdf']})

df['errors'] = df.apply(copy_files, axis=1)

print(df['errors'][0])

我原本期望打印字符串'DONE',但是我却收到了以下错误消息:
[Errno 2] No such file or directory: 'Y:\project\Test\x01\RAI\xcc\x81.pdf'
编辑:
如果使用原始字符串字面量,则可以这样做:
df = pd.DataFrame({r'1':[r'Y:\project\Test\1'],
              '2':[r'Y:\project\Test\2']})

def copy_files(row):
    try:
        shutil.copytree(row['1'], row['2'])
        return 'DONE'
    except Exception as e:
        return str(e)

df['errors'] = df.apply(copy_files, axis=1)

print(df['errors'][0])

我仍然得到以下的结果:

[('Y:\project\Test\1\RAI\xb4i.pdf', 'Y:\project\Test\2\RAI\xb4i.pdf', "[Errno 2] 没有这样的文件或目录:'Y:\\project\\Test\\1\\RAI\xb4i.pdf'")]


想知道为什么输出错误中看到的是 \xo1 而不是 "1"。 - Will
1个回答

1
你从哪里得到这些字符的?看起来你的名字包含一个“LATIN CAPITAL LETTER I WITH ACUTE”。问题是,无论编码方式如何,UNICODE都允许多种表示形式。它可以是(正规式C或规范组合)U+00CD或'\xcd',也可以是(正规式D或规范分解)U+0049后跟U+0301或'I\u0301'。这个NFD形式被读作LATIN CAPITAL LETTER I后跟COMBINING ACUTE ACCENT。
在打印或显示字符时不可能区分这两种形式,但不幸的是它们对于Python和文件系统来说是不同的字符串...
如何解决:避免在文件名中使用非ASCII字符。现在你知道为什么了...
解决方法:
  1. Your source contains the NFD form. It is likely that the filesystem contains the NFC form, so you could try:

    df = pd.DataFrame({'1':['Y:\project\Test\1\RAI\xcd.pdf'],'2': 
    ['Y:\project\Test\2\RAI\xcd.pdf']})
    
  2. The bulletproof way is to ask the filesystem what string actually is the filename:

    l = glob.glob('Y:\project\Test\1\RAI*.pdf')
    for name in l:
        print(name, [hex(ord(i)) for i in name])
    

    (notice the * and not a ? because in NFD form a single glyph could correspond to more than one character) That would dump the unicode codepoints of all characters as known by the file system. Provided you later use the exact same representation thing should go fine.


参考文献:


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接