如何在Python的Pandas Series中保留前导空格？

Question

如何在Python的Pandas Series中保留前导空格？

4

我正在尝试通过Python中Pandas的read_csv函数读取一个文本文件。我的文本文件长这样（所有数值都是数字）：

 35 61  7 1 0              # with leading white spaces
  0 1 1 1 1 1              # with leading white spaces
33 221 22 0 1              # without leading white spaces
233   2                    # without leading white spaces
1(01-02),2(02-03),3(03-04) # this line cause 'Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

我的Python代码如下：

import pandas as pd
df = pd.read_csv('example.txt', header=None)
df

输出结果如下：

CParserError: 'Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

在处理前导空格之前，我需要先处理一个“Error tokenizing data.”问题。因此，我更改了代码如下：

import pandas as pd
df = pd.read_csv('example.txt', header=None, error_bad_lines=False)
df

我可以按照我的意愿获取带有前导空格的数据，但第五行的数据已经消失了。输出结果如下：

b'Skipping line 5: expected 1 fields, saw 3\n
 35 61  7 1 0              # with leading white spaces as intended
  0 1 1 1 1 1              # with leading white spaces as intended
33 221 22 0 1              # without leading white spaces
233   2                    # without leading white spaces
                           # 5th line disappeared (not my intention).

所以我尝试修改以下代码，以获取第五行。

import pandas as pd
df = pd.read_csv('example.txt', header=None, sep=':::', engine='python')
df

我在第5行成功获取了数据，但是第1行和第2行的前导空格已经消失，具体如下：

35 61  7 1 0               # without leading white spaces(not my intention)
0 1 1 1 1 1                # without leading white spaces(not my intention)
33 221 22 0 1              # without leading white spaces
233   2                    # without leading white spaces
1(01-02),2(02-03),3(03-04) # I successfully got this line as intended.

我看到了一些有关保留字符串前导空格的帖子，但是我找不到保留数字前导空格的情况。感谢您的帮助。

- Sang-il Ahn

1

请展示一下你正在使用的代码。我无法在v0.22上复现这个问题。 - cs95

1

同时展示你的 df.dtypes - 也许你正在将该列转换为整数，这当然没有空格的概念。 - John Zwinck

使用dtype=object，更好地展示您的代码。 - Bharath M Shetty

你有3条评论要求你澄清。你能回应一下吗？ - cs95

感谢您的关注。我已经详细修改了我的问题。 - Sang-il Ahn

1

@Sang-ilAhn，现在你也可以点赞了;-) - MaxU - stand with Ukraine

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- cs95 · Accepted Answer

关键在于分隔符。如果您将sep指定为正则表达式^表示行首元字符，那么这将起作用。

s = pd.read_csv('example.txt', header=None, sep='^', squeeze=True)

s

0                  35 61  7 1 0
1                   0 1 1 1 1 1
2                 33 221 22 0 1
3                       233   2
4    1(01-02),2(02-03),3(03-04)
Name: 0, dtype: object

s[1]
'  0 1 1 1 1 1'