Pandas:按日期对一列进行分组,并计算另一列中特定值的累计数量

4

我想基于一个日期时间列对 Pandas 数据框进行分组,并基于此,根据特定的值计算另一列中特定出现次数的数量。假设我有以下数据框:

df = pd.DataFrame({
    "customer": [
         "A", "A", "A", "A", "A", "B", "C", "C"        
    ],
    "datetime": pd.to_datetime([
        "2020-01-01 00:00:00", "2020-01-02 00:00:00", "2020-01-02 01:00:00", "2020-01-03 00:00:00", "2020-01-04 00:00:00", "2020-01-03 00:00:00", "2020-01-03 00:00:00", "2020-01-04 00:00:00"         
    ]),
    "enabled": [
      True, True, False, True, True, True, False, True            
    ]    
})

数据框的样子如下:
customer    datetime                enabled
A           2020-01-01 00:00:00     True
A           2020-01-02 00:00:00     True
A           2020-01-02 01:00:00     False
A           2020-01-03 00:00:00     True
A           2020-01-04 00:00:00     True
B           2020-01-03 00:00:00     True
C           2020-01-03 00:00:00     False
C           2020-01-04 00:00:00     True

每天结束时,我想要计算启用的客户数。如果一个客户已经启用,除非后续日期有一个 enabled==False 的行,否则它将保持启用状态。预期输出如下:
day           count_enabled_customers
2020-01-01    1      # A
2020-01-02    0      # A has been disabled
2020-01-03    2      # A, B
2020-01-04    3      # A, B, C

有人有如何处理这个问题的想法吗?提前谢谢!


对于日期2020-01-04,计数不应该是2(A,C)吗? - sushanth
@Sushanth 客户B在“2020-01-03”启用,之后没有被禁用,因此他在接下来的几天仍然是启用状态,如果这样说可以的话。 - david fdez
1个回答

6

从您的数据框开始:

import pandas as pd

df = pd.DataFrame({
    "customer": [
         "A", "A", "A", "A", "A", "B", "C", "C"        
    ],
    "datetime": pd.to_datetime([
        "2020-01-01 00:00:00", "2020-01-02 00:00:00", "2020-01-02 01:00:00", "2020-01-03 00:00:00", "2020-01-04 00:00:00", "2020-01-03 00:00:00", "2020-01-03 00:00:00", "2020-01-04 00:00:00"         
    ]),
    "enabled": [
      True, True, False, True, True, True, False, True            
    ]    
})

print(df)

Out:
  customer            datetime  enabled
0        A 2020-01-01 00:00:00     True
1        A 2020-01-02 00:00:00     True
2        A 2020-01-02 01:00:00    False
3        A 2020-01-03 00:00:00     True
4        A 2020-01-04 00:00:00     True
5        B 2020-01-03 00:00:00     True
6        C 2020-01-03 00:00:00    False
7        C 2020-01-04 00:00:00     True

使用数据透视表将客户作为列,日期作为索引。
a = df.pivot(index='datetime', columns='customer', values='enabled')
print(a)

Out:
customer                 A     B      C
datetime                               
2020-01-01 00:00:00   True   NaN    NaN
2020-01-02 00:00:00   True   NaN    NaN
2020-01-02 01:00:00  False   NaN    NaN
2020-01-03 00:00:00   True  True  False
2020-01-04 00:00:00   True   NaN   True

创建你感兴趣的日期索引。
dates = pd.date_range(df.datetime.min().date(), df.datetime.max().date() + pd.offsets.Day(1), freq='D') - pd.offsets.Second(1)
print(dates)

Out:
DatetimeIndex(['2019-12-31 23:59:59', '2020-01-01 23:59:59',
               '2020-01-02 23:59:59', '2020-01-03 23:59:59',
               '2020-01-04 23:59:59'],
              dtype='datetime64[ns]', freq='D')

将你感兴趣的日期添加到索引中并对其进行排序,以便我们可以在下一步中使用ffill函数填充。
a = a.reindex(a.index.union(dates)).sort_index()
print(a)

Out:
customer                 A     B      C
2019-12-31 23:59:59    NaN   NaN    NaN
2020-01-01 00:00:00   True   NaN    NaN
2020-01-01 23:59:59    NaN   NaN    NaN
2020-01-02 00:00:00   True   NaN    NaN
2020-01-02 01:00:00  False   NaN    NaN
2020-01-02 23:59:59    NaN   NaN    NaN
2020-01-03 00:00:00   True  True  False
2020-01-03 23:59:59    NaN   NaN    NaN
2020-01-04 00:00:00   True   NaN   True
2020-01-04 23:59:59    NaN   NaN    NaN


向将来的日期填充已启用状态的最后一个值
a = a.ffill()
print(a)

Out: 
customer                 A     B      C
2019-12-31 23:59:59    NaN   NaN    NaN
2020-01-01 00:00:00   True   NaN    NaN
2020-01-01 23:59:59   True   NaN    NaN
2020-01-02 00:00:00   True   NaN    NaN
2020-01-02 01:00:00  False   NaN    NaN
2020-01-02 23:59:59  False   NaN    NaN
2020-01-03 00:00:00   True  True  False
2020-01-03 23:59:59   True  True  False
2020-01-04 00:00:00   True  True   True
2020-01-04 23:59:59   True  True   True

对于表示每天结束的时间戳,跨列求和

a.loc[dates].sum(axis=1)
print(a)

Out:
2019-12-31 23:59:59    0.0
2020-01-01 23:59:59    1.0
2020-01-02 23:59:59    0.0
2020-01-03 23:59:59    2.0
2020-01-04 23:59:59    3.0
Freq: D, dtype: float64

3
@davidfdez @jamesschofield df.set_index(['datetime', 'customer'])['enabled'].unstack().ffill().resample('D').last().sum(axis=1) 一行代码实现。 - Scott Boston
2
@ScottBoston,我认为这是一个总体上改进的答案。尽管如此,我仍然会保留枢轴,因为在我看来,它使代码更加明确,而不是df.set_index(['datetime', 'customer'])['enabled'].unstack() - James Schofield
实际上,使用重采样时枢轴并不起作用,所以我认为这是最好的答案。已点赞。 - James Schofield
结合两个建议,我们也可以这样做:df.pivot_table(index='datetime', columns='customer', values="enabled").ffill().resample('D').last().sum(axis=1)。两个想法都很棒,感谢大家! - david fdez

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接