Pandas重新索引分组中的日期

Question

Pandas重新索引分组中的日期

22

我有一个数据框，其索引为零散日期，列名为'id'和'num'。我想对'id'列进行pd.groupby分组，并将每个组中的索引重新排序应用于数据框。

我的样本数据集如下：

            id  num
2015-08-01  1   3
2015-08-05  1   5
2015-08-06  1   4
2015-07-31  2   1
2015-08-03  2   2
2015-08-06  2   3

当我使用ffill对pd.reindex进行操作时，我的期望输出为：

            id  num
2015-08-01  1   3
2015-08-02  1   3
2015-08-03  1   3
2015-08-04  1   3
2015-08-05  1   5
2015-08-06  1   4
2015-07-31  2   1
2015-08-01  2   1
2015-08-02  2   1
2015-08-03  2   2
2015-08-04  2   2
2015-08-05  2   2
2015-08-06  2   3

我已经尝试过其他方法，但是都没有成功：

newdf=df.groupby('id').reindex(method='ffill')

这会返回错误：AttributeError: Cannot access callable attribute 'reindex' of 'DataFrameGroupBy' objects, try using the 'apply' method

如果有帮助，将不胜感激。

- clg4

2个回答

-1

from cmath import pi
from datetime import datetime
from enum import unique
import io
from itertools import product

import numpy as np
import pandas as pd


df = pd.DataFrame(columns=['id','num'])
df['id'] = [1,1,1,2,2,2]
df['num'] = [3,5,4,1,2,3]
df['date'] = pd.date_range('1990-07-31', periods=6, freq='D')
print(df)
"""
   id  num       date
0   1    3 1990-07-31
1   1    5 1990-08-01
2   1    4 1990-08-02
3   2    1 1990-08-03
4   2    2 1990-08-04
5   2    3 1990-08-05

"""


df = df.set_index('date')

df = df.reindex(df.index.repeat(df['num']), method='ffill')

df['num_count'] = df.groupby(level=0).cumcount()

df = df.reset_index()

print (df)
"""
         date  id  num  num_count
0  1990-07-31   1    3          0
1  1990-07-31   1    3          1
2  1990-07-31   1    3          2
3  1990-08-01   1    5          0
4  1990-08-01   1    5          1
5  1990-08-01   1    5          2
6  1990-08-01   1    5          3
7  1990-08-01   1    5          4
8  1990-08-02   1    4          0
9  1990-08-02   1    4          1
10 1990-08-02   1    4          2
11 1990-08-02   1    4          3
12 1990-08-03   2    1          0
13 1990-08-04   2    2          0
14 1990-08-04   2    2          1
15 1990-08-05   2    3          0
16 1990-08-05   2    3          1
17 1990-08-05   2    3          2
"""

- Soudipta Dutta

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- JoeCondron · Accepted Answer

36

可能有更好的方法，但这种方法可行：

def reindex_by_date(df):
    dates = pd.date_range(df.index.min(), df.index.max())
    return df.reindex(dates).ffill()

df.groupby('id').apply(reindex_by_date).reset_index(0, drop=True)

- JoeCondron

这个可行。太棒了。虽然需要一点时间，但我想不出有更快更符合Python风格的方法来完成这个任务。非常感谢。 - clg4

稍微不同的写法是将函数的返回部分写成 return df.resample('D').fillna(method='ffill')。这样做的额外好处是，如果你只想要工作日，可以将重采样部分中的“D”更改为“B”（当然，这取决于你的需求）。 - Pilik

你确定 Pilik 的方法可行吗？因为我最初尝试使用 resample，但它没有添加缺失的日期。你也可以在 pd.date_range 中实现不同的时间段，例如使用 pd.date_range(.., offset='B') 来表示工作日。 - JoeCondron

@JoeCondron，你说得对，我不知道pd.date_range有一个偏移选项。我复制了使用pd.read_clipboard()的示例，并且我的解决方案使用resample也产生了所需的结果。 - Pilik

1

这个解决方案对我不起作用：所有行都变成了 NaN。如果我改变函数使用 df.reindex(dates, method='ffill')，它会给我一个 TypeError: Cannot compare type 'Timestamp' with type 'str'。 - Giacomo

1

@giac_man，听起来你的索引中有类似日期的字符串。DatetimIndex 看起来与包含形式为 'YYYY-MM-DD' 的字符串的索引相同。你可以使用 pd.to_datetime 进行转换。 - JoeCondron