Pandas读取CSV文件时没有遵循正则表达式分隔符。

3

数据:

from io import StringIO
import pandas as pd

s = '''ID,Level,QID,Text,ResponseID,responseText,date_key,last
375280046,S,D3M,Which is your favorite?,D5M0,option 1,2012-08-08 00:00:00,ynot
375280046,S,D3M,How often? (at home, at work, other),D3M0,Work,2010-03-31 00:00:00,okkk
375280046,M,A78,Do you prefer a, b, or c?,A78C,a,2010-03-31 00:00:00,abc
376918925,M,A78,Which ONE (select only one),A78E,Milk,2004-02-02 00:00:00,launch Wed., '''

df = pd.read_csv(StringIO(s), sep=r',(?!\s)')

问题: 我在这里提出了一个问题。 现在我遇到了一个新问题。请注意,上一行末尾是一个逗号和一个空格。sep=r',(?!\s)'中的正则表达式应该忽略后面跟着一个空格的逗号。

问题: 有没有办法使用pd.read_csv仅读取最后一列作为字面上的launch Wed.,,其中逗号不是分隔符/定界符,而是在last列文本中实际上是一个逗号?

错误:

ValueError: Expected 8 fields in line 5, saw 9. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.

期望/预期输出

:
          ID Level  QID                                  Text ResponseID  \
0  375280046     S  D3M               Which is your favorite?       D5M0   
1  375280046     S  D3M  How often? (at home, at work, other)       D3M0   
2  375280046     M  A78             Do you prefer a, b, or c?       A78C   
3  376918925     M  A78           Which ONE (select only one)       A78E   

  responseText             date_key           last  
0     option 1  2012-08-08 00:00:00           ynot  
1         Work  2010-03-31 00:00:00           okkk  
2            a  2010-03-31 00:00:00            abc  
3         Milk  2004-02-02 00:00:00  launch Wed.,   
2个回答

7

让我们看一下这个SO帖子

使用上面解释的这个正则表达式r',(?=\S)'

from io import StringIO
import pandas as pd

s = '''ID,Level,QID,Text,ResponseID,responseText,date_key,last
375280046,S,D3M,Which is your favorite?,D5M0,option 1,2012-08-08 00:00:00,ynot
375280046,S,D3M,How often? (at home, at work, other),D3M0,Work,2010-03-31 00:00:00,okkk
375280046,M,A78,Do you prefer a, b, or c?,A78C,a,2010-03-31 00:00:00,abc
376918925,M,A78,Which ONE (select only one),A78E,Milk,2004-02-02 00:00:00,launch Wed., '''

df = pd.read_csv(StringIO(s), sep=r',(?=\S)')

输出:

              ID                                 Level   QID      Text  \
375280046 S  D3M               Which is your favorite?  D5M0  option 1   
          S  D3M  How often? (at home, at work, other)  D3M0      Work   
          M  A78             Do you prefer a, b, or c?  A78C         a   
376918925 M  A78           Which ONE (select only one)  A78E      Milk   

                ResponseID  responseText  date_key          last  
375280046 S  2012-08-08 00             0         0          ynot  
          S  2010-03-31 00             0         0          okkk  
          M  2010-03-31 00             0         0           abc  
376918925 M  2004-02-02 00             0         0  launch Wed.,  

2

read_csv 看起来会在试图识别分隔符之前剥离字符串末尾的空格。这可以通过修改正则表达式以检查在文件结束位置之前仅标识为逗号的方式进行解决:

pd.read_csv(StringIO(s), sep=r',(?!\s|\Z)', engine='python')
Out[347]: 
          ID Level  QID                                  Text ResponseID  \
0  375280046     S  D3M               Which is your favorite?       D5M0   
1  375280046     S  D3M  How often? (at home, at work, other)       D3M0   
2  375280046     M  A78             Do you prefer a, b, or c?       A78C   
3  376918925     M  A78           Which ONE (select only one)       A78E   

  responseText             date_key          last  
0     option 1  2012-08-08 00:00:00          ynot  
1         Work  2010-03-31 00:00:00          okkk  
2            a  2010-03-31 00:00:00           abc  
3         Milk  2004-02-02 00:00:00  launch Wed.,  

有趣!文档中说:“仅在字符串末尾匹配”。我猜这意味着pandas逐行读取,这就是为什么\Z能够工作的原因。 - Jarad
@Jarad我只是在上面的字符串上下文中看这个,当我写下那句话的时候,它是最后一个字符,但是看起来是可以的。$同样有效,\Z甚至在问题行不在末尾时也有效(并且该行即使不在末尾仍会导致问题)。 - EFT

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接