使用pandas resample对特定列进行重新采样时排除某些列

Question

使用pandas resample对特定列进行重新采样时排除某些列

3

以下是问题的简化。

我有一个数据框，包含三列：状态开始的日期、状态本身和一个标识字段。它看起来类似于这样：

df = pd.DataFrame(
    {'begin': pd.to_datetime(['2018-01-05', '2018-07-11', '2018-11-14', '2019-02-19']),
    'state': [1, 2, 3, 4],
     'started': [1, 0, 0, 0]
     }
)

df

       begin  state  started
0 2018-01-05      1        1
1 2018-07-11      2        0
2 2018-11-14      3        0
3 2019-02-19      4        0

我希望对日期进行重新采样，使它们具有每个月的周期，我将通过以下方式实现：

df.set_index('begin', drop=False).resample('m').ffill()

df 
                begin  state  started
begin                                
2018-01-31 2018-01-05      1        1
2018-02-28 2018-01-05      1        1
2018-03-31 2018-01-05      1        1
2018-04-30 2018-01-05      1        1
2018-05-31 2018-01-05      1        1
2018-06-30 2018-01-05      1        1
2018-07-31 2018-07-11      2        0
2018-08-31 2018-07-11      2        0
2018-09-30 2018-07-11      2        0
2018-10-31 2018-07-11      2        0
2018-11-30 2018-11-14      3        0
2018-12-31 2018-11-14      3        0
2019-01-31 2018-11-14      3        0
2019-02-28 2019-02-19      4        0

除了表示状态的列（started）之外，一切看起来都很好。我希望它只在第一次出现时恰好为1，就像在原始数据框中一样。

期望的输出结果如下：

                begin  state  started
begin                                
2018-01-31 2018-01-05      1        1
2018-02-28 2018-01-05      1        0
2018-03-31 2018-01-05      1        0
2018-04-30 2018-01-05      1        0
2018-05-31 2018-01-05      1        0
2018-06-30 2018-01-05      1        0
2018-07-31 2018-07-11      2        0
2018-08-31 2018-07-11      2        0
2018-09-30 2018-07-11      2        0
2018-10-31 2018-07-11      2        0
2018-11-30 2018-11-14      3        0
2018-12-31 2018-11-14      3        0
2019-01-31 2018-11-14      3        0
2019-02-28 2019-02-19      4        0

因此，对于给定的begin和state组合，如果started为1，则它应该仅在此组合的第一次出现时为1。

是否有一种高效的方法来实现这个目标？

- Gerges

2个回答

1

如果“started”列中只有1和0，使用DataFrame.duplicated并在列表中指定两个列。

mask = df.duplicated(['begin','started'])

同时，也可以通过链式使用另一个掩码来重写仅包含1的值：

mask = df.duplicated(['begin','started']) & df['started'].eq(1)

df.loc[mask, 'started'] = 0

或者：

df['started'] = np.where(mask, 0, df['started'])

print (df)
                begin  state  started
begin                                
2018-01-31 2018-01-05      1        1
2018-02-28 2018-01-05      1        0
2018-03-31 2018-01-05      1        0
2018-04-30 2018-01-05      1        0
2018-05-31 2018-01-05      1        0
2018-06-30 2018-01-05      1        0
2018-07-31 2018-07-11      2        0
2018-08-31 2018-07-11      2        0
2018-09-30 2018-07-11      2        0
2018-10-31 2018-07-11      2        0
2018-11-30 2018-11-14      3        0
2018-12-31 2018-11-14      3        0
2019-01-31 2018-11-14      3        0
2019-02-28 2019-02-19      4        0

- jezrael

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- U13-Forward · Accepted Answer

你能做到吗：

df = df.set_index('begin', drop=False).resample('m').ffill()
df.loc[df['started'].duplicated(keep='first'), 'started'] = 0