将一个CSV文件分割成多个文件

31

我有一个包含大约5000行的CSV文件,想在Python中将其拆分为五个文件。

我写了一段代码,但它不起作用。

import codecs
import csv
NO_OF_LINES_PER_FILE = 1000
def again(count_file_header,count):
    f3 = open('write_'+count_file_header+'.csv', 'at')
    with open('import_1458922827.csv', 'rb') as csvfile:
        candidate_info_reader = csv.reader(csvfile, delimiter=',', quoting=csv.QUOTE_ALL)
        co = 0      
        for row in candidate_info_reader:
            co = co + 1
            count  = count + 1
            if count <= count:
                pass
            elif count >= NO_OF_LINES_PER_FILE:
                count_file_header = count + NO_OF_LINES_PER_FILE
                again(count_file_header,count)
            else:
                writer = csv.writer(f3,delimiter = ',', lineterminator='\n',quoting=csv.QUOTE_ALL)
                writer.writerow(row)

def read_write():
    f3 = open('write_'+NO_OF_LINES_PER_FILE+'.csv', 'at')
    with open('import_1458922827.csv', 'rb') as csvfile:


        candidate_info_reader = csv.reader(csvfile, delimiter=',', quoting=csv.QUOTE_ALL)

        count = 0       
        for row in candidate_info_reader:
            count  = count + 1
            if count >= NO_OF_LINES_PER_FILE:
                count_file_header = count + NO_OF_LINES_PER_FILE
                again(count_file_header,count)
            else:
                writer = csv.writer(f3,delimiter = ',', lineterminator='\n',quoting=csv.QUOTE_ALL)
                writer.writerow(row)

read_write()

以上代码创建了许多内容为空的文件。

如何将一个文件分成五个CSV文件?

13个回答

50

在Python中

使用readlines()writelines()来实现,这里是一个例子:

>>> csvfile = open('import_1458922827.csv', 'r').readlines()
>>> filename = 1
>>> for i in range(len(csvfile)):
...     if i % 1000 == 0:
...         open(str(filename) + '.csv', 'w+').writelines(csvfile[i:i+1000])
...         filename += 1

输出文件名将以数字编号1.csv2.csv等方式命名。

从终端

您可以使用split命令行实现此操作:

$ split -l 1000 import_1458922827.csv

5
文件长度为5003,你会怎么处理?会漏掉最后三行吗? - Rudziankoŭ
2
如果CSV文件太大无法在内存中容纳,你该怎么办? - mjwrazor
2
好的,干净利落的解决方案!谢谢!不需要担心,这个解决方案不会漏掉文件末尾的任何行。 - fachexot
2
这只是在每个文件中重复CSV标题,否则以后它将无法正常工作。 - Guillaume
请注意,此代码无法处理字符串定界符内包含 \n 符号的字符串。 - Nikit
显示剩余2条评论

38

我建议你不要重复发明轮子,已经有现成的解决方案了。源代码在这里

import os


def split(filehandler, delimiter=',', row_limit=1000,
          output_name_template='output_%s.csv', output_path='.', keep_headers=True):
    import csv
    reader = csv.reader(filehandler, delimiter=delimiter)
    current_piece = 1
    current_out_path = os.path.join(
        output_path,
        output_name_template % current_piece
    )
    current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
    current_limit = row_limit
    if keep_headers:
        headers = reader.next()
        current_out_writer.writerow(headers)
    for i, row in enumerate(reader):
        if i + 1 > current_limit:
            current_piece += 1
            current_limit = row_limit * current_piece
            current_out_path = os.path.join(
                output_path,
                output_name_template % current_piece
            )
            current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
            if keep_headers:
                current_out_writer.writerow(headers)
        current_out_writer.writerow(row)

使用方法:

split(open('/your/pat/input.csv', 'r'));

1
如果在行之间留空白是一个问题,只需将文件写入对象中的"w"替换为"wb"。 - Qaisar Rajput
这段代码如果分隔符是“,”逗号,则不会将列值用双引号括起来。我在csv.writer中添加了选项quoting=csv.QUOTE_ALL,但它并没有解决我的问题。 - Rfreak
8
@alexf 我也遇到了同样的错误,通过将 headers = reader.next() 修改为 headers = next(reader) 来进行修复。 - Jim
补充@Jim的评论,这是由于Python 2和Python 3之间的差异。在Python 3中使用内置函数next。也就是说,应该写成next(reader),而不是reader.next()。此外,您应该以文本模式打开文件。 - drewipson

7

一种适用于Python3的解决方案:

def split_csv(source_filepath, dest_folder, split_file_prefix,
                records_per_file):
    """
    Split a source csv into multiple csvs of equal numbers of records,
    except the last file.

    Includes the initial header row in each split file.

    Split files follow a zero-index sequential naming convention like so:

        `{split_file_prefix}_0.csv`
    """
    if records_per_file <= 0:
        raise Exception('records_per_file must be > 0')

    with open(source_filepath, 'r') as source:
        reader = csv.reader(source)
        headers = next(reader)

        file_idx = 0
        records_exist = True

        while records_exist:

            i = 0
            target_filename = f'{split_file_prefix}_{file_idx}.csv'
            target_filepath = os.path.join(dest_folder, target_filename)

            with open(target_filepath, 'w') as target:
                writer = csv.writer(target)

                while i < records_per_file:
                    if i == 0:
                        writer.writerow(headers)

                    try:
                        writer.writerow(next(reader))
                        i += 1
                    except StopIteration:
                        records_exist = False
                        break

            if i == 0:
                # we only wrote the header, so delete that file
                os.remove(target_filepath)

            file_idx += 1

2
我将这个内容转化为可执行脚本。如果有其他人也在尝试做同样的事情,这里是它的代码。 - travisw

6
一个使用Pandas的简单Python 3解决方案,不会截断最后一批数据。
def to_csv_batch(src_csv, dst_dir, size=30000, index=False):

    import pandas as pd
    import math
    
    # Read source csv
    df = pd.read_csv(src_csv)
    
    # Initial values
    low = 0
    high = size

    # Loop through batches
    for i in range(math.ceil(len(df) / size)):

        fname = dst_dir+'/Batch_' + str(i+1) + '.csv'
        df[low:high].to_csv(fname, index=index)
        
        # Update selection
        low = high
        if (high + size < len(df)):
            high = high + size
        else:
            high = len(df)

使用示例

to_csv_batch('Batch_All.csv', 'Batches')

5

我稍微修改了被接受的答案,使它更简单易懂。

编辑:添加导入语句,修改打印异常信息的语句。@Alex F的代码片段适用于python2,对于python3,您还需要使用header_row = rows.__next__()而不是header_row = rows.next()。感谢指出。

import os
import csv
def split_csv_into_chunks(file_location, out_dir, file_size=2):
    count = 0
    current_piece = 1

    # file_to_split_name.csv
    file_name = file_location.split("/")[-1].split(".")[0]
    split_file_name_template = file_name + "__%s.csv"
    splited_files_path = []

    if not os.path.exists(out_dir):
        os.makedirs(out_dir)
    try:
        with open(file_location, "rb") as csv_file:
            rows = csv.reader(csv_file, delimiter=",")
            headers_row = rows.next()
            for row in rows:
                if count % file_size == 0:
                    current_out_path = os.path.join(out_dir,
                                                    split_file_name_template%str(current_piece))
                    current_out_writer = None

                    current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=",")
                    current_out_writer.writerow(headers_row)
                    splited_files_path.append(current_out_path)
                    current_piece += 1

                current_out_writer.writerow(row)
                count += 1
        return True, splited_files_path
    except Exception as e:
        print("Exception occurred as {}".format(e))
        return False, splited_files_path

打印“Exception occurred as {}”.format(e) ^ 语法错误: 无效的语法 - Alex F

4

另一种使用 pandas 解决方案(每 1000 行一次),与 Aziz Alto 的解决方案相似:

suffix = 1
for i in range(len(df)):
    if i % 1000 == 0:
        df[i:i+1000].to_csv(f"processed/{filename}_{suffix}.csv", sep ='|', index=False, index_label=False)
        suffix += 1

其中df是作为pandas.DataFrame加载的csv文件;filename是原始文件名,|是分隔符;indexindex_label false用于跳过自动递增的索引列。


1
谢谢,这是我在这个挑战中找到的最好、最简单的解决方案。 - ra67052
1
这正是我在寻找的。谢谢。 - onxx

3

@Ryan,Python3的代码对我起作用了。如下所示,我使用了newline =''来避免空行问题:

with open(target_filepath, 'w', newline='') as target:

1
在最受欢迎的答案基础上,这里提供一个Python解决方案,它还包括每个文件中的头部信息。
file = open('file.csv', 'r')
header = file.readline()
csvfile = file.readlines()
filename = 1
batch_size = 1000
for i in range(len(csvfile)):
        if i % batch_size == 0:
                open(str(filename) + '.csv', 'w+').writelines(header)
                open(str(filename) + '.csv', 'a+').writelines(csvfile[i:i+batch_size])
                filename += 1

这将输出与1.csv、2.csv等相同的文件名。


1

1
一个更简单的脚本对我来说就可以。
import pandas as pd
path = "path to file" # path to file
df = pd.read_csv(path) # reading file

low = 0 # Initial Lower Limit
high = 1000 # Initial Higher Limit
while(high < len(df)):
    df_new = df[low:high] # subsetting DataFrame based on index
    low = high #changing lower limit
    high = high + 1000 # givig uper limit with increment of 1000
    df_new.to_csv("Path to output file") # output file 

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接