Pandas - 自上次交易以来的计数

Question

Pandas - 自上次交易以来的计数

3

我有一个数据框（称其为txn_df），其中包含货币交易记录，以下是这个问题中的重要列:

txn_year    txn_month   custid  withdraw    deposit
2011        4           123     0.0         100.0
2011        5           123     0.0         0.0
2011        6           123     0.0         0.0
2011        7           123     50.1        0.0
2011        8           123     0.0         0.0

假设我们这里有多个客户。对于`withdraw`和`deposit`值均为0.0，表示没有进行交易。我想要做的是产生一个新列，指示自上次交易以来经过了多少个月。类似于这样：

txn_year    txn_month   custid  withdraw    deposit     num_months_since_last_txn
2011        4           123     0.0         100.0       0
2011        5           123     0.0         0.0         1
2011        6           123     0.0         0.0         2           
2011        7           123     50.1        0.0         3
2011        8           123     0.0         0.0         1

目前我能想到的唯一解决方案是，当withdraw和deposit中任意一个数值>0.0时，生成一个新列has_txn（值为1/0或True/False），但我无法从这里继续。

- oikonomiyaki

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mohamed Thasin ah · Accepted Answer

解决这个问题的一种方法是，

df['series'] =  df[['withdraw','deposit']].ne(0).sum(axis=1)
m = df['series']>=1

正如@Chris A所评论的那样，

m = df[['withdraw','deposit']].gt(0).any(axis=1) #replacement for above snippet,

df['num_months_since_last_txn'] = df.groupby(m.cumsum()).cumcount()
df.loc[df['num_months_since_last_txn']==0,'num_months_since_last_txn']=(df['num_months_since_last_txn']+1).shift(1).fillna(0)
print df

输出：

   txn_year  txn_month  custid  withdraw  deposit
0      2011          4     123       0.0    100.0
1      2011          5     123       0.0      0.0
2      2011          6     123       0.0      0.0
3      2011          7     123      50.1      0.0
4      2011          8     123       0.0      0.0
   txn_year  txn_month  custid  withdraw  deposit  num_months_since_last_txn
0      2011          4     123       0.0    100.0                        0.0
1      2011          5     123       0.0      0.0                        1.0
2      2011          6     123       0.0      0.0                        2.0
3      2011          7     123      50.1      0.0                        3.0
4      2011          8     123       0.0      0.0                        1.0

解释：

使用 ne 来获取是否发生了交易，并使用求和得到二进制值。
当交易为 1 时，使用 groupby、cumsum 和 cumcount 创建从 0,1,2...n 的系列。
使用 .loc 对值为 0 进行重新排序。

注意：我可能增加了更多复杂性来解决这个问题。但是它会给你解决这个问题的思路和方法。

考虑客户 ID 的解决方案：

df=df.sort_values(by=['custid','txn_month'])
mask=~df.duplicated(subset=['custid'],keep='first')
m = df[['withdraw','deposit']].gt(0).any(axis=1)
df['num_months_since_last_txn'] = df.groupby(m.cumsum()).cumcount()
df.loc[df['num_months_since_last_txn']==0,'num_months_since_last_txn']=(df['num_months_since_last_txn']+1).shift(1)
df.loc[mask,'num_months_since_last_txn']=0

样例输入：

   txn_year  txn_month  custid  withdraw  deposit
0      2011          4     123       0.0    100.0
1      2011          5     123       0.0      0.0
2      2011          4    1245       0.0    100.0
3      2011          5    1245       0.0      0.0
4      2011          6     123       0.0      0.0
5      2011          7    1245      50.1      0.0
6      2011          7     123      50.1      0.0
7      2011          8     123       0.0      0.0
8      2011          6    1245       0.0      0.0
9      2011          8    1245       0.0      0.0

示例输出：

   txn_year  txn_month  custid  withdraw  deposit  num_months_since_last_txn
0      2011          4     123       0.0    100.0                        0.0
1      2011          5     123       0.0      0.0                        1.0
4      2011          6     123       0.0      0.0                        2.0
6      2011          7     123      50.1      0.0                        3.0
7      2011          8     123       0.0      0.0                        1.0
2      2011          4    1245       0.0    100.0                        0.0
3      2011          5    1245       0.0      0.0                        1.0
8      2011          6    1245       0.0      0.0                        2.0
5      2011          7    1245      50.1      0.0                        3.0
9      2011          8    1245       0.0      0.0                        1.0

考虑客户ID的解释：

以上代码基于[1,1]之间的时间间隔而运作。为了使格式保持一致，请按照cust_id和txn_month对df进行排序，并在未来加入txn_year。
fillna(0)在这里不起作用，因为shift不会为下一个客户创建NaN。要将其重置为0，找到客户ID的重复项并将第一个值替换为0。