在Python中使用Pandas将CSV文件合并为一个文件

Question

在Python中使用Pandas将CSV文件合并为一个文件

5

我有一个目录里有n个文件，需要合并成一个文件。它们具有相同的列数，例如，test1.csv 的内容如下：

test1,test1,test1  
test1,test1,test1  
test1,test1,test1

同样，test2.csv文件的内容如下：

test2,test2,test2  
test2,test2,test2  
test2,test2,test2

我希望final.csv看起来像这样：

test1,test1,test1  
test1,test1,test1  
test1,test1,test1  
test2,test2,test2  
test2,test2,test2  
test2,test2,test2

但实际上它的输出结果是这样的：

test file 1,test file 1.1,test file 1.2,test file 2,test file 2.1,test file 2.2  
,,,test file 2,test file 2,test file 2  
,,,test file 2,test file 2,test file 2  
test file 1,test file 1,test file 1,,,  
test file 1,test file 1,test file 1,,,

有人可以帮我弄清楚这里发生了什么吗？我在下面粘贴了我的代码：

import csv
import glob
import pandas as pd
import numpy as np 

all_data = pd.DataFrame() #initializes DF which will hold aggregated csv files

for f in glob.glob("*.csv"): #for all csv files in pwd
    df = pd.read_csv(f) #create dataframe for reading current csv
    all_data = all_data.append(df) #appends current csv to final DF

all_data.to_csv("final.csv", index=None)

- Jack Bauer

你为什么要使用 pandas 来创建一个单独的 CSV 文件呢？ - Padraic Cunningham

我是个新手，我认为这是最好的方法。 :/ - Jack Bauer

3个回答

2

您可以使用 concat。假设您有第一个数据框 df1 和第二个数据框 df2，您可以执行以下操作：

df = pd.concat([df1,df2],ignore_index=True)

< p > ignore_index 是可选的，如果你不在意单个数据帧的原始索引，可以将其设置为 True。

- Fabio Lamanna

1

如果您将“axis=0”作为参数传递，这将起作用。 - hahdawg

@hahdawg 感谢您指出。实际上，在 concat 中，0 是 axis 的默认值。 - Fabio Lamanna

@JackBauer 不用客气。请考虑接受其中一个答案，以帮助其他用户。 - Fabio Lamanna

我对这方面的经验有限，所以需要些时间来仔细研究，但我肯定会做到的。 - Jack Bauer

1

pandas 不是一个仅用于创建单个 csv 文件的工具，你可以在使用时将每个 csv 写入新文件中：

import glob

with open("out.csv","w") as out:
    for fle in glob.glob("*.csv"):
        with open(fle) as f:
             out.writelines(f)

或者，如果您更喜欢使用csv库：

import glob
import csv

with open("out.csv", "w") as out:
    wr = csv.writer(out)
    for fle in glob.glob("*.csv"):
        with open(fle) as f:
            wr.writerows(csv.reader(f))

创建一个大的数据框最终只是为了写入磁盘并没有实际意义，而且如果有很多大文件，这可能根本不可能。

- Padraic Cunningham

不用担心，如果您想对数据进行一些计算，pandas是一个很好的工具，但它不是将几个文件连接成一个文件的工具。 - Padraic Cunningham

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jezrael · Accepted Answer

我认为还存在以下问题：

我删除了 import csv 和 import numpy as np，因为在此演示中它们没有被使用（但是也许它们缺少了一些行，所以可以被导入）。
我创建了所有数据框的列表 dfs，其中数据框由 dfs.append(df) 追加。然后我使用函数 concat 将此列表连接到最终数据框中。
在函数 read_csv 中，我添加了参数 header=None，因为主要问题是 read_csv 将第一行读为 header。
在函数 to_csv 中，我添加了参数 header=None 以省略标题。
我将文件夹 test 添加到最终目标文件中，因为如果使用函数 glob.glob("*.csv")，则应该将输出文件作为输入文件读取。

解决方案:

import glob
import pandas as pd

all_data = pd.DataFrame() #initializes DF which will hold aggregated csv files

#list of all df
dfs = []
for f in glob.glob("*.csv"): #for all csv files in pwd
    #add parameters to read_csv
    df = pd.read_csv(f, header=None) #create dataframe for reading current csv
    #print df
    dfs.append(df) #appends current csv to final DF
all_data = pd.concat(dfs, ignore_index=True)
print all_data
#       0      1      2
#0  test1  test1  test1
#1  test1  test1  test1
#2  test1  test1  test1
#3  test2  test2  test2
#4  test2  test2  test2
#5  test2  test2  test2
all_data.to_csv("test/final.csv", index=None, header=None)

下一个解决方案类似。我在read_csv和to_csv中添加了参数header = None，并在append中添加了参数ignore_index = True。

import glob
import pandas as pd

all_data = pd.DataFrame() #initializes DF which will hold aggregated csv files

for f in glob.glob("*.csv"): #for all csv files in pwd
    df = pd.read_csv(f, header=None) #create dataframe for reading current csv
    all_data = all_data.append(df, ignore_index=True) #appends current csv to final DF
print all_data
#       0      1      2
#0  test1  test1  test1
#1  test1  test1  test1
#2  test1  test1  test1
#3  test2  test2  test2
#4  test2  test2  test2
#5  test2  test2  test2

all_data.to_csv("test/final.csv", index=None, header=None)