使用pandas按条件删除行

3

我创建了一些数据,看起来像这样:

import pandas as pd
d = {'Time': ['01.10.2019, 09:56:52', '01.10.2019, 09:57:15', '02.10.2019 09:57:23', '02.10.2019 10:02:58', '02.10.2019 13:11:58', '02.10.2019 13:22:55', '03.10.2019, 09:56:52', '03.10.2019, 09:57:15', '04.10.2019 09:57:23', '04.10.2019 10:02:58', '04.10.2019 13:11:58', '04.10.2019 13:22:55']
     ,'Action': ['Opened', 'Closed', 'Opened', 'Closed', 'Opened', 'Closed', 'Opened', 'Closed', 'Opened', 'Closed', 'Opened', 'Closed']
     ,'Name': ['Bayer', 'Bayer', 'ITM', 'ITM', 'ITM' , 'ITM', 'ITM', 'ITM', 'Treso', 'Treso', 'Geco' , 'Geco']}
df = pd.DataFrame(data=d)

     Time                    Action    Name
0    01.10.2019, 09:56:52    Opened    Bayer
1    01.10.2019, 09:57:15    Closed    Bayer
2    02.10.2019, 09:57:23    Opened    ITM
3    02.10.2019, 10:03:58    Closed    ITM
4    02.10.2019, 13:11:58    Opened    ITM
5    02.10.2019, 13:22:55    Closed    ITM
6    03.10.2019, 09:56:52    Opened    ITM
7    03.10.2019, 09:57:15    Closed    ITM
8    04.10.2019, 09:57:23    Opened    Treso
9    04.10.2019, 10:03:58    Closed    Treso
10    04.10.2019, 13:11:58    Opened    Geco
11    04.10.2019, 13:22:55    Closed    Geco

现在我想根据以下条件删除数据:

  • 如果打开和关闭之间的时间不到5分钟,并且名称相同,则应将其删除
  • 如果有一个打开动作,名称相同并且在关闭后重复出现,并且在同一天 -> 应删除所有具有相同名称的内容,从第一个打开到最后一个打开。例如,线2到线5应该被删除,但不是直到线7,因为它是一天后。

例如,第二个条件:如果存在此输入:

     Time                    Action    Name
0    02.10.2019, 09:57:23    Opened    ITM
1    02.10.2019, 10:03:58    Closed    ITM
2    02.10.2019, 13:11:58    Opened    ITM
3    02.10.2019, 13:22:55    Closed    ITM
4    03.10.2019, 09:56:52    Opened    ITM
5    03.10.2019, 09:57:15    Closed    ITM

我的输出应该像这样:

0    02.10.2019, 13:11:58    Opened    ITM
1    02.10.2019, 13:22:55    Closed    ITM
2    03.10.2019, 09:56:52    Opened    ITM
3    03.10.2019, 09:57:15    Closed    ITM

因为已经到了第二天,所以从10月2日到3日的时间跨度存在,而其他时间跨度均不超过5分钟。

但如果我们遇到以下情况:

0    02.10.2019, 09:57:23    Opened    ITM
1    02.10.2019, 10:03:58    Closed    ITM
2    02.10.2019, 13:11:58    Opened    ITM
3    02.10.2019, 13:22:55    Closed    ITM
4    02.10.2019, 09:56:52    Opened    ITM
5    02.10.2019, 09:57:15    Closed    ITM

只保留第二行和第三行的数据,其余数据应删除:

2    02.10.2019, 13:11:58    Opened    ITM
3    02.10.2019, 13:22:55    Closed    ITM

My wished output should be like this:

     Time                    Action    Name
0    02.10.2019, 09:57:23    Opened    ITM
3    02.10.2019, 13:22:55    Closed    ITM
4    03.10.2019, 09:56:52    Opened    ITM
5    03.10.2019, 09:57:15    Closed    ITM
6    04.10.2019, 09:57:23    Opened    Treso
7    04.10.2019, 10:03:58    Closed    Treso
8    04.10.2019, 13:11:58    Opened    Geco
9    04.10.2019, 13:22:55    Closed    Geco

我的尝试:

df_new = df.assign(group=pd.to_datetime(df["Time"]).diff().dt.seconds.gt(300).cumsum()).groupby(["group", 
                                                                                                    "Time", 
                                                                                                    "Action",
                                                                                                    "Name"]).first()

有人能帮帮我吗?


1
开放和关闭状态是否总是连续的? - Yuca
是的,它应该每次打开和关闭,因此它们应该是连续的。 - user14576365
1
如果始终连续,似乎shift()可以处理小于5分钟的情况。 - noah
1
你能够设置一个类似于23:59开放,00:04关闭的时间吗? - Yuca
这个时间段不可能发生,因为最后的时间戳是17:30。 - user14576365
显示剩余8条评论
1个回答

0
假设您的逻辑需要:
  1. 消除所有相隔不到5分钟的内容。
  2. 从剩余值中删除那些在一天内有多个开放时间的名称:

基于您的数据:

import pandas as pd
d = {'Time': ['01.10.2019, 09:56:52', '01.10.2019, 09:57:15', '02.10.2019 09:57:23', '02.10.2019 10:02:58', '02.10.2019 13:11:58', '02.10.2019 13:22:55', '03.10.2019, 09:56:52', '03.10.2019, 09:57:15', '04.10.2019 09:57:23', '04.10.2019 10:02:58', '04.10.2019 13:11:58', '04.10.2019 13:22:55']
     ,'Action': ['Opened', 'Closed', 'Opened', 'Closed', 'Opened', 'Closed', 'Opened', 'Closed', 'Opened', 'Closed', 'Opened', 'Closed']
     ,'Name': ['Bayer', 'Bayer', 'ITM', 'ITM', 'ITM' , 'ITM', 'ITM', 'ITM', 'Treso', 'Treso', 'Geco' , 'Geco']}
df = pd.DataFrame(data=d)

首先进行一些转换和排序,以确保数据处于正确的顺序。

## convert time to datetime
df['Time']=pd.to_datetime(df['Time'], dayfirst=True)

# get date
df['Date'] = df['Time'].dt.date

## make sure it's sorted by Name, and time:
df = df.sort_values(['Name', 'Time'])

## get time till next action for same client
df['time_to_next_action'] = \
np.where(((df['Name']==df['Name'].shift(-1)) & (df['Action']=='Opened')), df['Time'].shift(-1) - df['Time'], 0)

## convert time difference to minutes
df['time_to_next_action'] = df['time_to_next_action'].dt.total_seconds()/60.0

# CONDITION 1:

#delete entries under 5 minutes:

df = df[np.where(
    (
    (
        (df['Action']=='Opened') & (df['time_to_next_action']<5)
    )
    |
    (
        (df['Action']=='Closed') & (df['time_to_next_action'].shift(1)<5)
    )
    ), False, True
    )]

### explanation:  
## two possibilities: 
# 1. if action is Open and time to next action is less than 5 minutes delete it 
# 2. if action is 'Close' and time delta from previous action is less than 5 minutes, delete it

### CONDITION 2 edited based on comments:

## keep only first 'Opened'
df_first_open = df[df['Action']=='Opened'].sort_values(['Name', 'Time']).drop_duplicates(subset=['Name', 'Date', 'Action'])

## keep only last 'Closed'
df_last_close = df[df['Action']=='Closed'].sort_values(['Name', 'Time'], ascending = False).drop_duplicates(subset=['Name', 'Date', 'Action'])

## combine and sort the two
df = pd.concat([df_first_open, df_last_close]).sort_values(['Name', 'Time'])

# OPTIONAL: you can drop the extra columns:
df = df.drop(columns=['Date', 'time_to_next_action'])

print(df)

新输出:

Time    Action  Name
10  2019-10-04 13:11:58 Opened  Geco
11  2019-10-04 13:22:55 Closed  Geco
2   2019-10-02 09:57:23 Opened  ITM
5   2019-10-02 13:22:55 Closed  ITM
8   2019-10-04 09:57:23 Opened  Treso
9   2019-10-04 10:02:58 Closed  Treso

作为记录,原始条件2是:

## get date/name combinations that had only 1 'Open' per day:
df_to_keep = df[df['Action']=='Opened'].groupby(['Name', 'Action', 'Date']).count().reset_index()
df_to_keep = df_to_keep[np.where(df_to_keep['Time']==1, True, False)]

# those are the ones you'll keep in final output:
df = pd.merge(df_to_keep[['Name', 'Date']], df, how='left', on=['Name', 'Date'])

谢谢,只有一个错误!如果只有两个值(打开和关闭),则应删除少于5分钟的时间。如果有多于两个值,则取第一个打开状态和最后一个关闭状态,并删除中间所有内容。 - user14576365
啊...那是另一回事。我编辑了答案。 - yulGM

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接