这里有一个适用于ID排序或未排序的解决方案。未排序版本唯一的开销是多次打开目标(组ID)CSV文件:
import csv
reader = csv.reader(open("test.csv", newline=""))
prev_id = None
out_file = None
writer = None
for row in reader:
this_id = row[0]
if this_id != prev_id:
if out_file is not None:
out_file.close()
fname = f"file_{this_id}.csv"
out_file = open(fname, "a", newline="")
writer = csv.writer(out_file)
prev_id = this_id
writer.writerow(row)
这是测试输入,但现在1和2交错:
1, a1, 0.1
2, b1, 0.1
1, a1, 0.2
2, b1, 0.2
1, a1, 0.4
2, b1, 0.4
1, a1, 0.3
2, b1, 0.3
1, a1, 0.0
2, b1, 0.0
1, a1, 0.9
2, b1, 0.9
当我运行它时,我看到:
./main.py
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...
我的输出文件看起来像:
1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9
并且
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
我还创建了一个假的大文件,大小为289MB,有100个ID组(每个ID有250,000行),我的解决方案大约在12秒内运行。相比之下,使用groupby()
的被接受答案在大型CSV上运行约10秒;高评分的awk脚本则需要约1分钟。
group_id
排序了吗? - senderle