Pandas: 条件移位

Question

Pandas: 条件移位

18

有没有办法根据另外两列的条件移动数据框的某一列？就像这样：

df["cumulated_closed_value"] = df.groupby("user").['close_cumsum'].shiftWhile(df['close_time']>df['open_time])

我已经想出了一种方法来做到这一点，但它效率不高：

1）加载数据并创建要移动的列

df=pd.read_csv('data.csv')
df.sort_values(['user','close_time'],inplace=True)
df['close_cumsum']=df.groupby('user')['value'].cumsum()
df.sort_values(['user','open_time'],inplace=True)
print(df)

输出：

   user  open_time close_time  value  close_cumsum
0     1 2017-01-01 2017-03-01      5            18
1     1 2017-01-02 2017-02-01      6             6
2     1 2017-02-03 2017-02-05      7            13
3     1 2017-02-07 2017-04-01      3            21
4     1 2017-09-07 2017-09-11      1            22
5     2 2018-01-01 2018-02-01     15            15
6     2 2018-03-01 2018-04-01      3            18

2) 使用自连接和过滤器来移动列

使用自连接（这会占用较多内存）：df2=pd.merge(df[['user','open_time']],df[['user','close_time','close_cumsum']], on='user')

过滤出'close_time' < 'open_time'的行，然后获取最大close_time对应的行。

df2=df2[df2['close_time']<df2['open_time']]
idx = df2.groupby(['user','open_time'])['close_time'].transform(max) == df2['close_time']
df2=df2[idx]

3)合并到原始数据集：

df3=pd.merge(df[['user','open_time','close_time','value']],df2[['user','open_time','close_cumsum']],how='left')
print(df3)

输出：

   user  open_time close_time  value  close_cumsum
0     1 2017-01-01 2017-03-01      5           NaN
1     1 2017-01-02 2017-02-01      6           NaN
2     1 2017-02-03 2017-02-05      7           6.0
3     1 2017-02-07 2017-04-01      3          13.0
4     1 2017-09-07 2017-09-11      1          21.0
5     2 2018-01-01 2018-02-01     15           NaN
6     2 2018-03-01 2018-04-01      3          15.0

有更多“pandas”的方法可以得到相同的结果吗？

编辑：我已添加了一行数据，以使情况更加清晰。我的目标是获取在新交易开盘时间之前关闭的所有交易的总和。

- riccardo nizzolo

1

@Wen的回答有什么问题吗？看起来赏金是在wen的回答之后添加的，但我没有发现wen的回答有任何问题。如果您想要更多或不同的内容，请详细说明。 - JohnE

1

好的，既然你修改了问题，我会更新我的答案。 - BENY

3个回答

8

我对您的测试用例进行了修改，我认为您应该将其包含在内。这个解决方案可以处理您的编辑。

import pandas as pd
import numpy as np
df = pd.read_csv("cond_shift.csv")
df

输入：

   user open_time   close_time  value
0   1   12/30/2016  12/31/2016  1
1   1   1/1/2017    3/1/2017    5
2   1   1/2/2017    2/1/2017    6
3   1   2/3/2017    2/5/2017    7
4   1   2/7/2017    4/1/2017    3
5   1   9/7/2017    9/11/2017   1
6   2   1/1/2018    2/1/2018    15
7   2   3/1/2018    4/1/2018    3

创建列以进行位移：

df["open_time"] = pd.to_datetime(df["open_time"])
df["close_time"] = pd.to_datetime(df["close_time"])
df.sort_values(['user','close_time'],inplace=True)
df['close_cumsum']=df.groupby('user')['value'].cumsum()
df.sort_values(['user','open_time'],inplace=True)
df


   user open_time   close_time  value   close_cumsum
0   1   2016-12-30  2016-12-31  1       1
1   1   2017-01-01  2017-03-01  5       19
2   1   2017-01-02  2017-02-01  6       7
3   1   2017-02-03  2017-02-05  7       14
4   1   2017-02-07  2017-04-01  3       22
5   1   2017-09-07  2017-09-11  1       23
6   2   2018-01-01  2018-02-01  15      15
7   2   2018-03-01  2018-04-01  3       18

移动列（下面有解释）：

df["cumulated_closed_value"] = df.groupby("user")["close_cumsum"].transform("shift")
condition = ~(df.groupby("user")['close_time'].transform("shift") < df["open_time"])
df.loc[ condition,"cumulated_closed_value" ] = None
df["cumulated_closed_value"] =df.groupby("user")["cumulated_closed_value"].fillna(method="ffill").fillna(0)
df


user    open_time   close_time  value   close_cumsum    cumulated_closed_value
0   1   2016-12-30  2016-12-31  1       1               0.0
1   1   2017-01-01  2017-03-01  5       19              1.0
2   1   2017-01-02  2017-02-01  6       7               1.0
3   1   2017-02-03  2017-02-05  7       14              7.0
4   1   2017-02-07  2017-04-01  3       22              14.0
5   1   2017-09-07  2017-09-11  1       23              22.0
6   2   2018-01-01  2018-02-01  15      15              0.0
7   2   2018-03-01  2018-04-01  3       18              15.0

所有这些都是以适用于所有用户的方式编写的。如果你只关注一个用户，那么这个逻辑会更容易理解。

假设没有同时发生的事件。这与将累积总和向下移动一行相同。
删除与其他事件同时发生的事件。
使用前向填充方法填充缺失值。

在使用之前，请务必进行彻底测试。时间间隔很奇怪，有很多边缘情况。

- Gabriel A

6

（注意：我认为@wen的答案很好，所以我不确定原始问题是否寻求更多或不同的东西。无论如何，这是另一种使用merge_asof的备选方法，也应该很好地工作。）

首先，将数据框重塑如下：

lookup = ( df[['close_time','value','user']].set_index(['user','close_time'])
           .sort_index().groupby('user').cumsum().reset_index(0) )

df = df.set_index('open_time').sort_index()

“查找”的概念很简单，只需按“close_time”排序，然后进行（分组）累加即可：

            user  value
close_time             
2017-02-01     1      6
2017-02-05     1     13
2017-03-01     1     18
2017-04-01     1     21
2017-09-11     1     22
2018-02-01     2     15
2018-04-01     2     18

对于“df”，我们只需要从原始数据集中取一个子集：

            user close_time  value
open_time                         
2017-01-01     1 2017-03-01      5
2017-01-02     1 2017-02-01      6
2017-02-03     1 2017-02-05      7
2017-02-07     1 2017-04-01      3
2017-09-07     1 2017-09-11      1
2018-01-01     2 2018-02-01     15
2018-03-01     2 2018-04-01      3

从这里开始，您只需要在“用户”和“开放时间”/“关闭时间”上概念性地合并两个数据集，但复杂的因素是我们不想对时间进行精确匹配，而是一种“最近”的匹配方式。

对于这些类型的合并，您可以使用merge_asof，它是用于各种非精确匹配（包括“最近”，“向后”和“向前”）的绝佳工具。不幸的是，由于包含groupby，还需要循环遍历用户，但仍然是相当简单易懂的代码：

df_merged = pd.DataFrame()

for u in df['user'].unique():
    df_merged = df_merged.append( pd.merge_asof( df[df.user==u],  lookup[lookup.user==u], 
                                                 left_index=True, right_index=True, 
                                                 direction='backward' ) )

df_merged.drop('user_y',axis=1).rename({'value_y':'close_cumsum'},axis=1)

结果：

            user_x close_time  value_x  close_cumsum
open_time                                           
2017-01-01       1 2017-03-01        5           NaN
2017-01-02       1 2017-02-01        6           NaN
2017-02-03       1 2017-02-05        7           6.0
2017-02-07       1 2017-04-01        3          13.0
2017-09-07       1 2017-09-11        1          21.0
2018-01-01       2 2018-02-01       15           NaN
2018-03-01       2 2018-04-01        3          15.0

- JohnE

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- BENY · Accepted Answer

我在这里使用一个新的段落记录条件 df2['close_time']<df2['open_time']

df['New']=((df.open_time-df.close_time.shift()).dt.days>0).shift(-1)
s=df.groupby('user').apply(lambda x : (x['value']*x['New']).cumsum().shift()).reset_index(level=0,drop=True)
s.loc[~(df.New.shift()==True)]=np.nan

df['Cumsum']=s




df

Out[1043]: 
   user  open_time close_time  value    New Cumsum
0     1 2017-01-01 2017-03-01      5  False    NaN
1     1 2017-01-02 2017-02-01      6   True    NaN
2     1 2017-02-03 2017-02-05      7   True      6
3     1 2017-02-07 2017-04-01      3  False     13
4     2 2017-01-01 2017-02-01     15   True    NaN
5     2 2017-03-01 2017-04-01      3    NaN     15

更新：自从问题被提出后（来自Gabriel A的数据）

df['New']=df.user.map(df.groupby('user').close_time.apply(lambda x: np.array(x)))
df['New1']=df.user.map(df.groupby('user').value.apply(lambda x: np.array(x)))
df['New2']=[[x>m for m in y] for x,y in zip(df['open_time'],df['New'])  ]
df['Yourtarget']=list(map(sum,df['New2']*df['New1'].values))
df.drop(['New','New1','New2'],1)


Out[1376]: 
   user  open_time close_time  value  Yourtarget
0     1 2016-12-30 2016-12-31      1           0
1     1 2017-01-01 2017-03-01      5           1
2     1 2017-01-02 2017-02-01      6           1
3     1 2017-02-03 2017-02-05      7           7
4     1 2017-02-07 2017-04-01      3          14
5     1 2017-09-07 2017-09-11      1          22
6     2 2018-01-01 2018-02-01     15           0
7     2 2018-03-01 2018-04-01      3          15