将多个CSV文件合并为一个大文件并输出（Pandas）

Question

将多个CSV文件合并为一个大文件并输出（Pandas）

5

我目前有一个目录，我们称其为/mydir，其中包含36个CSV文件，每个文件大小为2.1 GB，维度相同。它们都是相同的大小，我想将它们读入pandas中，将它们并排连接在一起（使行数保持不变），然后将结果数据框作为一个大型csv输出。我用来做这件事的代码可以合并其中几个文件，但在某一点后会出现内存错误。我想知道是否有比我所拥有的更有效的方法。

df = pd.DataFrame()
for file in os.listdir('/mydir'):
    df.concat([df, pd.read_csv('/mydir' + file, dtype = 'float)], axis = 1)
df.to_csv('mydir/file.csv')

有人建议我将其分解成较小的部分，每6个文件为一组，然后依次将它们组合在一起，但我不知道这是否是一个有效的解决方案，可以避免内存错误问题。

编辑：目录视图：

-rw-rw---- 1 m2762 2.1G Jul 11 10:35 2010.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:32 2001.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:28 1983.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:21 2009.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:21 1991.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:07 2000.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:06 1982.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:01 1990.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:01 2008.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:55 1999.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:54 1981.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 2007.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1998.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1989.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1980.csv

- JSolomonCulp

2

所有文件的行数都一样吗？ - MaxU - stand with Ukraine

1

使用Linux的paste工具怎么样 - paste -d',' *.csv > result.csv？ - MaxU - stand with Ukraine

你能发布一下 ls /mydir/*.csv 的输出吗？ - MaxU - stand with Ukraine

1

你完成这个 72 GB 的 csv 文件后，打算怎么处理它？也许将其转换为数据库文件（或 h5 文件）并使用 Blaze 库会更好一些。或者，你是否需要每个文件中的所有列？通过仅选择几列，你可以减少内存占用。 - Corley Brigman

我一定会注意@CorleyBrigman的评论。我会将这么大的CSV文件转换成HDF5文件——它更可靠、更快、保留数据类型、支持压缩，并且可以有条件地读取数据... - MaxU - stand with Ukraine

显示剩余6条评论

2个回答

0

假设MaxU的答案是所有文件都有相同的行数，进一步假设像引用等次要的CSV差异在所有文件中都以相同的方式完成，那么您不需要使用Pandas来完成此操作。常规文件readlines将为您提供字符串，您可以将它们连接并写出。进一步假设您可以提供行数。类似这样的代码：

    numrows = 999 # whatever.  Probably pass as argument to function or on cmdline
    out_file = open('myout.csv','w')
    infile_names = [ 'file01.csv',
                     'file02.csv',
                      ..
                     'file36.csv' ]

    # open all the input files
    infiles = []
    for fname in infile_names:
        infiles.append(open(fname))

    for i in range(numrows):
        # read a line from each input file and add it to the output string
        out_csv=''
        for infile2read in infiles:
            out_csv += infile2read.readline().strip() + ','
        out_csv[-1] = '\n' # replace final comma with newline

        # write this rows data out to the output file
        outfile.write(out_csv)

    #close the files
    for f in infiles:
        f.close()
    outfile.close()

- verisimilidude

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- piRSquared · Accepted Answer

分块它们！

from glob import glob
import os

# grab files
files = glob('./[0-9][0-9][0-9][0-9].csv')

# simplify the file reading
# notice this will create a generator
# that goes through chunks of the file
# at a time
def read_csv(f, n=100):
    return pd.read_csv(f, index_col=0, chunksize=n)

# simplify the concatenation
def concat(lot):
    return pd.concat(lot, axis=1)

# simplify the writing
# make sure mode is append and header is off
# if file already exists
def to_csv(f, df):
    if os.path.exists(f):
        mode = 'a'
        header = False
    else:
        mode = 'w'
        header = True
    df.to_csv(f, mode=mode, header=header)

# Fun stuff! zip will take the next element of the generator
# for each generator created for each file
# concat one chunk at a time and write
for lot in zip(*[read_csv(f, n=10) for f in files]):
    to_csv('out.csv', concat(lot))