使用Python Pandas读取CSV文件到数据框中，而不需要分隔符。

Question

使用Python Pandas读取CSV文件到数据框中，而不需要分隔符。

4

我对Pandas库还很陌生。
我有一些基于数据帧的代码。

有没有一种方法可以逐行读取gzip文件，而不使用任何分隔符（使用整行，该行可能包含逗号和其他字符）作为单个行，并将其用于数据帧？看起来需要提供分隔符，当我提供"\n"时，它能够读取，但是error_bad_lines会报错，类似于“跳过第xxx行：期望22个字段，但是得到了23”个字段，因为每行都是不同的。

我希望它将每行都视为数据帧中的单个行。如何实现这一点？任何提示都将不胜感激。

- user2418898

2

请提供一个包含您的数据的最小可重现示例。 - peer

1

你说把每一行看作一个单独的行没问题，但是列呢？ - Chris Doyle

@ChrisDoyle 我只需要将这种类型的文件作为1列加载。 - user2418898

你可以将分隔符设置为一些不会出现的字符组合，例如 ||| 或 ^^。 - R. Arctor

@R.Arctor 我考虑过这个问题，但从技术上讲，我无法控制数据的内容，它可能包含任何字符。 - user2418898

如果不是CSV格式，为什么要通过read_csv来加载它呢？ - Chris Doyle

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Chris Doyle · Accepted Answer

如果你只想每行只有一行和一列，那么不要使用read_csv。只需逐行读取文件并从中构建数据框。

你可以手动创建一个带有单个列头的空数据框。然后迭代文件中的每一行，将其附加到数据框中。

#explicitly iterate over each line in the file appending it to the df.
import pandas as pd
with open("query4.txt") as myfile:
    df = pd.DataFrame([], columns=['line'])
    for line in myfile:
        df = df.append({'line': line}, ignore_index=True)
    print(df)

这将适用于大文件，因为我们一次只处理一行并构建数据框，所以不会使用比所需更多的内存。这可能不是最有效的方法，因为在此处重复分配数据框，但它肯定可以工作。

然而，我们可以更清晰地完成此操作，因为 pandas 数据框可以将可迭代对象作为输入数据。

#create a list to feed the data to the dataframe.
import pandas as pd
with open("query4.txt") as myfile:
    mydata = [line for line in myfile]
    df = pd.DataFrame(mydata, columns=['line'])
    print(df)

在这里，我们将文件中的所有行读入一个列表，然后将该列表传递给pandas创建数据。但是这样做的缺点是，如果我们的文件非常大，那么我们将在内存中拥有两份副本：列表和数据框。

考虑到我们知道pandas会接受可迭代的数据，因此我们可以使用生成器表达式来获取一个生成器，该生成器将每一行文件提供给数据框。现在，数据框将通过从文件中逐行读取来自动构建。

#create a generator to feed the data to the dataframe.
import pandas as pd
with open("query4.txt") as myfile:
    mydata = (line for line in myfile)
    df = pd.DataFrame(mydata, columns=['line'])
    print(df)

在这三种情况下，无需使用read_csv，因为要加载的数据不是csv格式。每种解决方案都提供相同的数据框输出。 源数据

this is some data
this is other data
data is fun
data is weird
this is the 5th line

数据框架

                   line
0   this is some data\n
1  this is other data\n
2         data is fun\n
3       data is weird\n
4  this is the 5th line