将一个 CSV 文件根据模式拆分成多个文件。

Question

将一个 CSV 文件根据模式拆分成多个文件。

3

我有一个具有以下结构的csv文件：

time,magnitude
0,13517
292.5669,370
620.8469,528
0,377
832.3269,50187
5633.9419,3088
20795.0950,2922
21395.6879,2498
21768.2139,647
21881.2049,194
0,3566
292.5669,370
504.1510,712
1639.4800,287
46709.1749,365
46803.4400,500

我想将这个csv文件拆分成多个csv文件，如下所示：

文件1：

time,magnitude
0,13517
292.5669,370
620.8469,528

文件2：

time,magnitude
0,377
832.3269,50187
5633.9419,3088
20795.0950,2922
21395.6879,2498

等等等等...

我已经阅读了几篇类似的帖子（例如，this, this, 或者 this one），但它们都是在列中搜索特定值并将每个值组保存到单独的文件中。然而，在我的情况下，时间列的值不同。我想根据条件拆分：如果时间=0，则保存该行和所有后续行到新文件，直到下一个时间=0。

请问有人可以告诉我如何做到这一点吗？

- mOna

当时间等于零时，您仍在寻找特定值，是吗？ - GIZ

(1) 请添加一个样本CSV。这些不是您想要使用的原始数据。 (2) 您坚持使用Python解决方案吗？这看起来像是一个经典的awk解决方案。 - dodrg

@GIZ：是的，但我不仅想保存时间=0的行，还要保存直到下一个时间=0的行。 - mOna

@dodrg：1）我不确定如何在这里附加一个csv文件，但是我的csv文件与我在问题中展示的结构完全相同。2）不，任何解决方案都可以 :) - mOna

1

您可以将 CSV（或其中一部分）直接复制/粘贴到问题中，就像从文件中一样。您在此处发布的内容都不是 CSV 格式（没有逗号）。相反，您发布了数据的不同表示形式。您应该按原样发布原始数据。 - Code-Apprentice

显示剩余2条评论

4个回答

3

datasplit.awk

#!/usr/bin/awk -f

BEGIN
{
    filename = "output_file_"
    fileext = ".csv"
    FS = ","

    c = 0
    file = filename c fileext
    getline
    header = $0
}
{
    if ($1 == 0){
        c = c + 1
        file = filename c fileext
        print header > file
        print $0 >> file
    } else {
        print >> file
    }
}

将文件设为可执行：

chmod +x datasplit.awk

在数据将被写入的文件夹中开始：

datasplit.awk datafile

- dodrg

2

我冒昧创建了一些类似于您提供的数据以测试我的解决方案。此外，我没有使用输入的 csv 文件，而是用了一个 dataframe。这是我的解决方案：

import pandas as pd
import numpy as np

# Create a random DataFrame

data = {
   'time': [0, 292.5669, 620.8469, 0, 832.3269, 5633.9419, 20795.0950, 21395.6879, 0, 230.5678, 456.8468, 0, 784.3265, 5445.9452, 20345.0980, 21095.6898],
   'magnitude': [13517, 370, 528, 377, 50187, 3088, 2922, 2498, 13000, 369, 527, 376, 50100, 3087, 2921, 2497]
}

df = pd.DataFrame(data)

# Function to split a DataFrame based on a pattern

def split_dataframe_by_pattern(df, output_prefix):
    file_count = 1
    current_group = pd.DataFrame(columns=df.columns)  # Initialize the current group

    for index, row in df.iterrows():
        if row['time'] == 0 and not current_group.empty:  # If time = 0 and the current group is not empty, create a new file
            output_file = f'{output_prefix}_{file_count}.csv'

            # Save the current group to the new file

            current_group.to_csv(output_file, index=False)
            current_group = pd.DataFrame(columns=df.columns)  # Reset the current group
            file_count += 1

        # Use pandas.concat to append the row to the current group
        current_group = pd.concat([current_group, row.to_frame().T], ignore_index=True)

    # Save the last group to a file

    current_group.to_csv(f'{output_prefix}_{file_count}.csv', index=False)

# Example usage:
output_prefix = 'output_file'
split_dataframe_by_pattern(df, output_prefix)

我的输出是四个csv文件：

output_file_1.csv

time,magnitude
0.0,13517.0
292.5669,370.0
620.8469,528.0

output_file_2.csv

time,magnitude
0.0,377.0
832.3269,50187.0
5633.9419,3088.0
20795.095,2922.0
21395.6879,2498.0

output_file_3.csv

time,magnitude
0.0,13000.0
230.5678,369.0
456.8468,527.0

output_file_4.csv

time,magnitude
0.0,376.0
784.3265,50100.0
5445.9452,3087.0
20345.098,2921.0
21095.6898,2497.0

- cconsta1

2

你可以用Pandas很容易地做到这一点，像这样：

import pandas as pd
df = pd.read_csv("mydata.csv")
last_idx = 0
file_idx = 0
for i,time in enumerate(df.time):
    if time == 0 and i != 0:
        df.iloc[last_idx:i].to_csv(f"mydata_{file_idx}.csv", index=None)
        file_idx += 1
        last_idx = i
df.iloc[last_idx:].to_csv(f"mydata_{file_idx}.csv", index=None)

- TheEngineerProgrammer

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Timeless · Accepted Answer

使用 pandas，您可以使用 groupby 和 boolean indexing：

#pip install pandas
import pandas as pd

df = pd.read_csv("input_file.csv", sep=",") # <- change the sep if needed

for n, g in df.groupby(df["time"].eq(0).cumsum()):
    g.to_csv(f"file_{n}.csv", index=False, sep=",")

输出：

    time  magnitude   # <- file_1.csv
  0.0000      13517
292.5669        370
620.8469        528

      time  magnitude # <- file_2.csv
    0.0000        377
  832.3269      50187
 5633.9419       3088
20795.0950       2922
21395.6879       2498