将多个CSV文件导入pandas并合并为一个DataFrame

Question

将多个CSV文件导入pandas并合并为一个DataFrame

757

我想从一个目录中读取多个CSV文件并将它们连接成一个大的DataFrame。但是我还没有找到解决方法。这是我目前的代码：

import glob
import pandas as pd

# Get data file names
path = r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

dfs = []
for filename in filenames:
    dfs.append(pd.read_csv(filename))

# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)

我想我需要在for循环中寻求一些帮助？

- jonas

20个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- mjspier · Answer 1

使用列表推导式的另一种一行代码，允许在read_csv中使用参数。

df = pd.concat([pd.read_csv(f'dir/{f}') for f in os.listdir('dir') if f.endswith('.csv')])

- Nim J · Answer 2

如果有多个CSV文件被压缩，你可以使用zipfile读取所有文件并按以下方式连接它们：

import zipfile
import pandas as pd

ziptrain = zipfile.ZipFile('yourpath/yourfile.zip')

train = []

train = [ pd.read_csv(ziptrain.open(f)) for f in ziptrain.namelist() ]

df = pd.concat(train)

- Henrik · Answer 3

使用 pathlib 库作为替代方式（通常优先于使用 os.path）。

该方法避免了反复使用 pandas 的 concat()/append() 方法。

来自 pandas 文档：
值得注意的是，concat()（因此也包括 append()）将会对数据进行完全复制，并且频繁地重复使用此函数可能会导致显著的性能下降。如果您需要在多个数据集上执行此操作，请使用列表推导。

import pandas as pd
from pathlib import Path

dir = Path("../relevant_directory")

df = (pd.read_csv(f) for f in dir.glob("*.csv"))
df = pd.concat(df)

- Paul Rougieux · Answer 4

基于Sid的好答案。

识别缺失或未对齐列的问题

在连接之前，您可以将CSV文件加载到一个中间字典中，该字典根据文件名（以dict_of_df['filename.csv']的形式）提供对每个数据集的访问。这样的字典可以帮助您识别异构数据格式的问题，例如列名没有对齐。

导入模块并定位文件路径:

import os
import glob
import pandas
from collections import OrderedDict
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

注意：OrderedDict不是必须的，但它可以保留文件的顺序，这可能对分析有用。

将CSV文件加载到字典中。然后连接：

dict_of_df = OrderedDict((f, pandas.read_csv(f)) for f in filenames)
pandas.concat(dict_of_df, sort=True)

键是文件名f，值是CSV文件的数据框内容。

除了将f作为字典键之外，还可以使用os.path.basename(f)或其他os.path方法将字典键的大小减小到只包含相关的较小部分。

- Gonçalo Peres · Answer 5

import os

os.system("awk '(NR == 1) || (FNR > 1)' file*.csv > merged.csv")

在这里，NR和FNR分别代表当前正在处理的行数。

FNR是每个文件中的当前行。

NR == 1表示第一个文件的第一行（标题），而FNR > 1则跳过每个后续文件的第一行。

- Chasing Unicorn - Anshu · Answer 6

如果出现未命名列问题，请使用以下代码沿x轴合并多个CSV文件。

import glob
import os
import pandas as pd

merged_df = pd.concat([pd.read_csv(csv_file, index_col=0, header=0) for csv_file in glob.glob(
        os.path.join("data/", "*.csv"))], axis=0, ignore_index=True)

merged_df.to_csv("merged.csv")

- westandskif · Answer 7

考虑使用convtools库，该库提供了许多数据处理原语，并在幕后生成简单的即席代码。它不应该比pandas/polars更快，但有时可能会更快。

例如，您可以将csv文件连接成一个文件以供进一步重用-以下是代码：

import glob

from convtools import conversion as c
from convtools.contrib.tables import Table
import pandas as pd


def test_pandas():
    df = pd.concat(
        (
            pd.read_csv(filename, index_col=None, header=0)
            for filename in glob.glob("tmp/*.csv")
        ),
        axis=0,
        ignore_index=True,
    )
    df.to_csv("out.csv", index=False)
# took 20.9 s


def test_convtools():
    table = None
    for filename in glob.glob("tmp/*.csv"):
        table_ = Table.from_csv(filename, header=False)
        if table is None:
            table = table_
        else:
            table = table.chain(table_)

    table.into_csv("out_convtools.csv", include_header=False)
# took 15.8 s

当然，如果你只想获取一个数据框而不写入一个连接的文件，那么它将分别需要 4.63秒 和 10.9秒（pandas 在这里更快，因为它不需要压缩列来写回）。

- neha · Answer 8

你也可以这样做：

import pandas as pd
import os

new_df = pd.DataFrame()
for r, d, f in os.walk(csv_folder_path):
    for file in f:
        complete_file_path = csv_folder_path+file
        read_file = pd.read_csv(complete_file_path)
        new_df = new_df.append(read_file, ignore_index=True)


new_df.shape

- Shaina Raza · Answer 9

以下是使用Colaboratory和Google Drive的方法：

import pandas as pd
import glob

path = r'/content/drive/My Drive/data/actual/comments_only' # Use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True,sort=True)
frame.to_csv('/content/drive/onefile.csv')

- YASH GUPTA · Answer 10

import pandas as pd
import glob

path = r'C:\DRO\DCL_rawdata_files' # use your path
file_path_list = glob.glob(path + "/*.csv")

file_iter = iter(file_path_list)

list_df_csv = []
list_df_csv.append(pd.read_csv(next(file_iter)))

for file in file_iter:
    lsit_df_csv.append(pd.read_csv(file, header=0))
df = pd.concat(lsit_df_csv, ignore_index=True)