使用 Pandas: 根据另一个时间区间的数据筛选数据框

3
如果我有一个数据框(df_data),格式如下:
ID        Time                X        Y        Z        H
05  2020-06-26 14:13:16    0.055    0.047    0.039    0.062
05  2020-06-26 14:13:21    0.063    0.063    0.055    0.079
05  2020-06-26 14:13:26    0.063    0.063    0.063    0.079
05  2020-06-26 14:13:31    0.095    0.102    0.079    0.127
...    ..    ...     ...     ...      ...      ...      ...
01  2020-07-01 08:59:43    0.063    0.063    0.047    0.079
01  2020-07-01 08:59:48    0.055    0.055    0.055    0.079
01  2020-07-01 08:59:53    0.071    0.063    0.055    0.082
01  2020-07-01 08:59:58    0.063    0.063    0.047    0.082
01  2020-07-01 08:59:59    0.047    0.047    0.047    0.071

[17308709 rows x 8 columns]

我希望你能将另一个区间数据框(df_intervals)的内容用于筛选数据框中的内容,例如:

int_id         start               end
1            2020-02-03 18:11:59   2020-02-03 18:42:00
2            2020-02-03 19:36:59   2020-02-03 20:06:59
3            2020-02-03 21:00:59   2020-02-03 21:31:00
4            2020-02-03 22:38:00   2020-02-03 23:08:00
5            2020-02-04 05:55:00   2020-02-04 06:24:59
...                         ...                   ...
1804         2021-01-10 13:50:00   2021-01-10 14:20:00
1805         2021-01-10 18:10:00   2021-01-10 18:40:00
1806         2021-01-10 19:40:00   2021-01-10 20:10:00
1807         2021-01-10 21:25:00   2021-01-10 21:55:00
1808         2021-01-10 22:53:00   2021-01-10 23:23:00

[1808 rows x 2 columns]

什么是最有效的方法?我有一个庞大数据集,如果我尝试像这样迭代它:
for i in range(len(intervals)):
    df_filtered = df[df['Time'].between(intervals['start'][i], intervals['end'][i])
    ...
    ...
    ...

太慢了!我知道不应该迭代超大型数据框,但我不知道如何按第二个数据框的每个时间间隔进行过滤。

我尝试的步骤如下:

1- 从df_intervals获取所有时间间隔(开始/结束列);

2- 使用这些时间间隔创建一个新数据框(df_stats),其中包含这些时间范围内列的统计信息。例如:

      start                  end             ID    X_max    X_min    X_mean    Y_max    Y_min    Y_mean    ....
2020-02-03 18:11:59   2020-02-03 18:42:00    01    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 18:11:59   2020-02-03 18:42:00    02    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 18:11:59   2020-02-03 18:42:00    03    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 18:11:59   2020-02-03 18:42:00    04    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 18:11:59   2020-02-03 18:42:00    05    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 19:36:59   2020-02-03 20:06:59    01    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 19:36:59   2020-02-03 20:06:59    02    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 19:36:59   2020-02-03 20:06:59    03    ...    ...    ...     ...   ...    ...    ...     ...
2个回答

2
如果Joe的答案不能提供您所需的速度,我认为可以通过在for循环中消除统计计算来改进它。(我使用他的df创建,因为他在他的答案中是一个英雄。)理想情况下,你也可以摆脱for循环,但我认为由于时间戳索引重复(跨ID号码),将这两个数据框合并可能会很棘手。
这是我的尝试,仍然使用迭代来处理开始/结束时间。首先,我将int_id应用于父df。我想将其添加到父数据框中,以便我可以“groupby”而不必制作“temp”数据框并对其进行统计。
for index, row in df2.iterrows():
    
    df1.loc[df1.Time.between(row.start,row.end), 'int_id'] = row.int_id

    ID                Time      X      Y      Z      H  int_id
0   01 2020-02-03 18:13:16  0.011  0.012  0.013  0.014     1.0
1   01 2020-02-03 18:13:21  0.015  0.016  0.017  0.018     1.0
2   01 2020-02-03 18:13:26  0.013  0.013  0.013  0.013     1.0
3   01 2020-02-03 18:13:31  0.015  0.015  0.015  0.015     1.0
4   02 2020-02-03 18:13:16  0.021  0.022  0.023  0.024     1.0
5   02 2020-02-03 18:13:21  0.025  0.026  0.027  0.028     1.0
6   02 2020-02-03 18:13:26  0.023  0.023  0.023  0.023     1.0

然后我定义聚合以在循环结束后一次性获取所有内容。
aggs = {'X':['sum', 'max', 'mean', 'median'], 
        'Y':['sum', 'max', 'mean', 'median'], 
        'Z':['sum', 'max', 'mean', 'median'], 
        'H':['sum', 'max', 'mean', 'median']}

df2 = df1.groupby(by=('int_id')).agg(aggs)

            X                            Y                             Z                            H                        
          sum    max    mean median    sum    max     mean median    sum    max    mean median    sum    max     mean  median
int_id                                                                                                                       
1.0     0.732  0.095  0.0366  0.034  0.739  0.102  0.03695  0.034  0.708  0.079  0.0354  0.034  0.827  0.127  0.04135  0.0345
2.0     0.732  0.095  0.0366  0.034  0.739  0.102  0.03695  0.034  0.708  0.079  0.0354  0.034  0.827  0.127  0.04135  0.0345

注意:这里的列有多个索引。您可以使用以下方法将它们连接起来。
df_final.columns = ['_'.join(col).strip() for col in df_final.columns.values]

        X_sum  X_max  X_mean  X_median  Y_sum  Y_max   Y_mean  Y_median  Z_sum  Z_max  Z_mean  Z_median  H_sum  H_max   H_mean  H_median
int_id                                                                                                                                  
1.0     0.732  0.095  0.0366     0.034  0.739  0.102  0.03695     0.034  0.708  0.079  0.0354     0.034  0.827  0.127  0.04135    0.0345
2.0     0.732  0.095  0.0366     0.034  0.739  0.102  0.03695     0.034  0.708  0.079  0.0354     0.034  0.827  0.127  0.04135    0.0345

1
感谢您的帖子。已点赞。看到了改进。同意它。 - Joe Ferndz
感谢你们两位抽出时间帮助我!我在这种情况下进行了一些测试:我的df1有17174122行,而我的df2只有1786行...按照你们建议的方法(iterrows)对它们进行迭代花费了7分20秒!虽然比我目前的方法稍微好一点,但我想知道是否还有其他更有效的方法,或者这是最好的结果,因为我的数据库很大!再次感谢你们的帮助。 - VictorSalles
在尝试加速代码之前,确定哪一部分最慢会很有用。我建议将代码拆分为四个函数。第一个函数用于加载大型数据帧,第二个函数使用iterrows()方法向大型数据帧添加组。接下来的函数将接受已添加组的数据帧,最后一个函数将保存输出结果。如果你使用cProfile对这些函数进行运行,你可以看到哪一个是最慢的。https://dev59.com/xnRB5IYBdhLWcg3wl4AP - Correy Koshnick
@CorreyKoshnick 抱歉,我的回答不够清晰!我只检查了iterrows()部分添加组到大df的经过时间(那个单独的循环花费了7分钟以上)...现在我检查了第二部分,也就是groupby聚合,只花费了2秒钟!所以最慢的部分是对大df进行迭代。 - VictorSalles

1

以下是完整的代码,可以实现此操作。我尝试创建了一些示例数据以查看是否有效。请对您的完整数据集运行此代码,并查看是否提供所需结果。

  1. 第一步:创建一个临时列表来存储中间数据帧。

    temp_list = []

  2. 第二步:遍历数据帧2。对于每个选择的行,执行以下操作:

    • 过滤出开始和结束日期的数据帧1的行

      temp = df1[df1.Time.between(row.start,row.end)]

    • 按ID分组并获取X、Y、Z和H的统计值。每列一组

      x = temp.groupby('ID' ['X'].agg(['min', 'max', 'mean', 'median']).add_prefix('X_').reset_index()

    • 将所有的X、Y、Z、H项合并到单个数据帧中。

    • 将开始和结束日期添加到合并后的数据帧中

    • 将数据帧附加到temp_list中

  3. 第三步:使用temp_list创建最终数据帧

  4. 第四步:按您的需求重新排列列。首先是开始和结束日期,然后是ID,然后是X值、Y值、Z值和最后是H值。

  5. 第五步:打印数据帧

实现此功能的完整代码:

c1 = ['ID','Time','X','Y','Z','H']
d1 = [
['01','2020-02-03 18:13:16',0.011,0.012,0.013,0.014],
['01','2020-02-03 18:13:21',0.015,0.016,0.017,0.018],
['01','2020-02-03 18:13:26',0.013,0.013,0.013,0.013],
['01','2020-02-03 18:13:31',0.015,0.015,0.015,0.015],
     
['02','2020-02-03 18:13:16',0.021,0.022,0.023,0.024],
['02','2020-02-03 18:13:21',0.025,0.026,0.027,0.028],
['02','2020-02-03 18:13:26',0.023,0.023,0.023,0.023],
['02','2020-02-03 18:13:31',0.025,0.025,0.025,0.025],
     
['03','2020-02-03 18:13:16',0.031,0.032,0.033,0.034],
['03','2020-02-03 18:13:21',0.035,0.036,0.037,0.038],
['03','2020-02-03 18:13:26',0.033,0.033,0.033,0.033],
['03','2020-02-03 18:13:31',0.035,0.035,0.035,0.035],

['04','2020-02-03 18:13:16',0.041,0.042,0.043,0.044],
['04','2020-02-03 18:13:21',0.045,0.046,0.047,0.048],
['04','2020-02-03 18:13:26',0.043,0.043,0.043,0.043],
['04','2020-02-03 18:13:31',0.045,0.045,0.045,0.045],
     
['05','2020-02-03 18:13:16',0.055,0.047,0.039,0.062],
['05','2020-02-03 18:13:21',0.063,0.063,0.055,0.079],
['05','2020-02-03 18:13:26',0.063,0.063,0.063,0.079],
['05','2020-02-03 18:13:31',0.095,0.102,0.079,0.127],
     
['01','2020-02-03 20:03:16',0.011,0.012,0.013,0.014],
['01','2020-02-03 20:03:21',0.015,0.016,0.017,0.018],
['01','2020-02-03 20:03:26',0.013,0.013,0.013,0.013],
['01','2020-02-03 20:03:31',0.015,0.015,0.015,0.015],
     
['02','2020-02-03 20:03:16',0.021,0.022,0.023,0.024],
['02','2020-02-03 20:03:21',0.025,0.026,0.027,0.028],
['02','2020-02-03 20:03:26',0.023,0.023,0.023,0.023],
['02','2020-02-03 20:03:31',0.025,0.025,0.025,0.025],
     
['03','2020-02-03 20:03:16',0.031,0.032,0.033,0.034],
['03','2020-02-03 20:03:21',0.035,0.036,0.037,0.038],
['03','2020-02-03 20:03:26',0.033,0.033,0.033,0.033],
['03','2020-02-03 20:03:31',0.035,0.035,0.035,0.035],

['04','2020-02-03 20:03:16',0.041,0.042,0.043,0.044],
['04','2020-02-03 20:03:21',0.045,0.046,0.047,0.048],
['04','2020-02-03 20:03:26',0.043,0.043,0.043,0.043],
['04','2020-02-03 20:03:31',0.045,0.045,0.045,0.045],
     
['05','2020-02-03 20:03:16',0.055,0.047,0.039,0.062],
['05','2020-02-03 20:03:21',0.063,0.063,0.055,0.079],
['05','2020-02-03 20:03:26',0.063,0.063,0.063,0.079],
['05','2020-02-03 20:03:31',0.095,0.102,0.079,0.127],
     
['01','2020-07-01 08:59:43',0.063,0.063,0.047,0.079],
['01','2020-07-01 08:59:48',0.055,0.055,0.055,0.079],
['01','2020-07-01 08:59:53',0.071,0.063,0.055,0.082],
['01','2020-07-01 08:59:58',0.063,0.063,0.047,0.082],
['01','2020-07-01 08:59:59',0.047,0.047,0.047,0.071]]

import pandas as pd
df1 = pd.DataFrame(d1,columns=c1)
df1.Time = pd.to_datetime(df1.Time)

c2 = ['int_id','start','end']
d2 = [[1,'2020-02-03 18:11:59','2020-02-03 18:42:00'],
[2,'2020-02-03 19:36:59','2020-02-03 20:06:59'],
[3,'2020-02-03 21:00:59','2020-02-03 21:31:00'],
[4,'2020-02-03 22:38:00','2020-02-03 23:08:00'],
[5,'2020-02-04 05:55:00','2020-02-04 06:24:59'],
[1804,'2021-01-10 13:50:00','2021-01-10 14:20:00'],
[1805,'2021-01-10 18:10:00','2021-01-10 18:40:00'],
[1806,'2021-01-10 19:40:00','2021-01-10 20:10:00'],
[1807,'2021-01-10 21:25:00','2021-01-10 21:55:00'],
[1808,'2021-01-10 22:53:00','2021-01-10 23:23:00']]

import pandas as pd
from functools import reduce

df2 = pd.DataFrame(d2,columns=c2)

df2.start = pd.to_datetime(df2.start)
df2.end = pd.to_datetime(df2.end)

temp_list = []

for i, row in df2.iterrows():

    temp = df1[df1.Time.between(row.start,row.end)]

    x = temp.groupby('ID')['X'].agg(['min','max','mean','median']).add_prefix('X_').reset_index()
    y = temp.groupby('ID')['Y'].agg(['min','max','mean','median']).add_prefix('Y_').reset_index()
    z = temp.groupby('ID')['Z'].agg(['min','max','mean','median']).add_prefix('Z_').reset_index()
    h = temp.groupby('ID')['H'].agg(['min','max','mean','median']).add_prefix('H_').reset_index()

    data_frames = [x,y,z,h]

    df_merged = reduce(lambda left,right: pd.merge(left,right,on=['ID'],
                            how='outer'), data_frames).fillna('void')

    df_merged['start'] = row.start
    df_merged['end'] = row.end
    
    temp_list.append(df_merged)


df_final = pd.concat(temp_list, ignore_index=True)

column_names = ['start','end','ID',
                    'X_min','X_max','X_mean','X_median',
                    'Y_min','Y_max','Y_mean','Y_median',
                    'Z_min','Z_max','Z_mean','Z_median',
                    'H_min','H_max','H_mean','H_median']

df_final = df_final[column_names]

print (df_final)

这将产生的输出为:
                start                 end  ID  ...  H_max   H_mean  H_median
0 2020-02-03 18:11:59 2020-02-03 18:42:00  01  ...  0.018  0.01500    0.0145
1 2020-02-03 18:11:59 2020-02-03 18:42:00  02  ...  0.028  0.02500    0.0245
2 2020-02-03 18:11:59 2020-02-03 18:42:00  03  ...  0.038  0.03500    0.0345
3 2020-02-03 18:11:59 2020-02-03 18:42:00  04  ...  0.048  0.04500    0.0445
4 2020-02-03 18:11:59 2020-02-03 18:42:00  05  ...  0.127  0.08675    0.0790
5 2020-02-03 19:36:59 2020-02-03 20:06:59  01  ...  0.018  0.01500    0.0145
6 2020-02-03 19:36:59 2020-02-03 20:06:59  02  ...  0.028  0.02500    0.0245
7 2020-02-03 19:36:59 2020-02-03 20:06:59  03  ...  0.038  0.03500    0.0345
8 2020-02-03 19:36:59 2020-02-03 20:06:59  04  ...  0.048  0.04500    0.0445
9 2020-02-03 19:36:59 2020-02-03 20:06:59  05  ...  0.127  0.08675    0.0790

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接