用Python的方式填充行,使其包含日期范围

8
我正在处理一个问题陈述,需要我填充缺失日期的行(即在 Pandas 数据帧的列中两个日期之间的日期)。请参见下面的示例。我正在使用 Pandas 进行当前方法(如下所述)。 输入数据示例(大约有 25000 行):
A  | B  | C  | Date1    | Date2
a1 | b1 | c1 | 1Jan1990 | 15Aug1990 <- this row should be repeated for all dates between the two dates
.......................
a3 | b3 | c3 | 11May1986 | 11May1986 <- this row should NOT be repeated. Just 1 entry since both dates are same.
.......................
a5 | b5 | c5 | 1Dec1984 | 31Dec2017 <- this row should be repeated for all dates between the two dates
..........................
..........................

预期输出:
A  | B  | C  | Month    | Year
a1 | b1 | c1 | 1        | 1990  <- Since date 1 column for this row was Jan 1990
a1 | b1 | c1 | 2        | 1990    
.......................
.......................
a1 | b1 | c1 | 7        | 1990  
a1 | b1 | c1 | 8        | 1990  <- Since date 2 column for this row was Aug 1990
..........................
a3 | b3 | c3 | 5        | 1986  <- only 1 row since two dates in input dataframe were same for this row.
...........................
a5 | b5 | c5 | 12       | 1984 <- since date 1 column for this row was Dec 1984
a5 | b5 | c5 | 1        | 1985 
..........................
..........................
a5 | b5 | c5 | 11       | 2017 
a5 | b5 | c5 | 12       | 2017 <- Since date 2 column for this row was Dec 2017

我知道另一种更传统的方法来实现这个(我的当前方法):
  • 迭代每一行。
  • 获取两个日期列之间的天数差异。
  • 如果两栏中的日期相同,则在输出数据框中仅包括该月和年份的一个行。
  • 如果日期不同(差值 > 0),则为每个日期差异行获取所有(month, year)组合,并添加到新的数据框。
由于输入数据大约有25000行,我相信输出数据将非常庞大,因此我正在寻找更“Pythonic”的方法来实现这个目标(如果可能而且比迭代方法更快)。

所以,根据您期望的输出,您希望在日期之间为每个月添加一行,对吗? - Nick
是的。但包括月份(这两列日期的月份)。我还更新了输入和输出,以包括两个日期相同的情况。 - Saurabh Gokhale
我投票关闭此问题,因为可以改进但仍能工作的代码应该在codereview.stackexchange.com上,而不是stackoverflow.com上。 - Eric Renouf
3个回答

8

在这里,我认为最好使用的工具是 PeriodIndex (用于生成日期之间的月份和年份)。

然而,PeriodIndex 只能一次处理一行数据。因此,如果我们要使用 PeriodIndex,每一行都必须单独处理。不幸的是,这意味着需要循环遍历 DataFrame 的每一行:

import pandas as pd
df = pd.DataFrame([('a1','b1','c1','1Jan1990','15Aug1990'),
                   ('a3','b3','c3','11May1986','11May1986'),
                   ('a5','b5','c5','1Dec1984','31Dec2017')],
                  columns=['A','B','C','Date1','Date2'])

result = [] 
for tup in df.itertuples():
    index = pd.PeriodIndex(start=tup.Date1, end=tup.Date2, freq='M')
    new_df = pd.DataFrame([(tup.A, tup.B, tup.C)], index=index)
    new_df['Month'] = new_df.index.month
    new_df['Year'] = new_df.index.year
    result.append(new_df)
result = pd.concat(result, axis=0)
print(result)

产量
          0   1   2  Month  Year
1990-01  a1  b1  c1      1  1990    <--- Beginning of row 1
1990-02  a1  b1  c1      2  1990
1990-03  a1  b1  c1      3  1990
1990-04  a1  b1  c1      4  1990
1990-05  a1  b1  c1      5  1990
1990-06  a1  b1  c1      6  1990
1990-07  a1  b1  c1      7  1990
1990-08  a1  b1  c1      8  1990    <--- End of row 1
1986-05  a3  b3  c3      5  1986    <--- Beginning and End of row 2
1984-12  a5  b5  c5     12  1984    <--- Beginning row 3
1985-01  a5  b5  c5      1  1985
1985-02  a5  b5  c5      2  1985
1985-03  a5  b5  c5      3  1985
1985-04  a5  b5  c5      4  1985
...      ..  ..  ..    ...   ...
2017-09  a5  b5  c5      9  2017
2017-10  a5  b5  c5     10  2017
2017-11  a5  b5  c5     11  2017
2017-12  a5  b5  c5     12  2017    <--- End of row 3

[406 rows x 5 columns]

请注意,您可能不需要定义 MonthYear 列。
new_df['Month'] = new_df.index.month
new_df['Year'] = new_df.index.year

因为您已经拥有一个PeriodIndex,这使得计算月份和年份非常容易。


虽然使用您的示例必须迭代每一行,但它非常有效(显然比我目前的方法更快)。谢谢! - Saurabh Gokhale

4
这里有另一种方法,使用2个辅助理解和 numpy.repeat.
import numpy as np
import pandas as pd

repeats = (pd.to_datetime(df['Date2']) - pd.to_datetime(df['Date1'])) // np.timedelta64(1, 'M') + 1
periods = np.concatenate([pd.period_range(start=pd.to_datetime(d), periods=r, freq='M')
                          for d, r in zip(df['Date1'], repeats)])

new_df = (pd.DataFrame(
            np.repeat(df.values, repeats, 0),
            columns=df.columns,
            index=periods)
          .assign(month = [x.month for x in periods],
                  year = [x.year for x in periods])
          .drop(['Date1', 'Date2'], axis=1))

print(new_df)

[out]
          A   B   C  month  year
1990-01  a1  b1  c1      1  1990
1990-02  a1  b1  c1      2  1990
1990-03  a1  b1  c1      3  1990
1990-04  a1  b1  c1      4  1990
1990-05  a1  b1  c1      5  1990
1990-06  a1  b1  c1      6  1990
1990-07  a1  b1  c1      7  1990
1990-08  a1  b1  c1      8  1990
1986-05  a3  b3  c3      5  1986
1984-12  a5  b5  c5     12  1984
1985-01  a5  b5  c5      1  1985
1985-02  a5  b5  c5      2  1985
1985-03  a5  b5  c5      3  1985
1985-04  a5  b5  c5      4  1985
1985-05  a5  b5  c5      5  1985
1985-06  a5  b5  c5      6  1985
1985-07  a5  b5  c5      7  1985
1985-08  a5  b5  c5      8  1985
1985-09  a5  b5  c5      9  1985
1985-10  a5  b5  c5     10  1985
1985-11  a5  b5  c5     11  1985
1985-12  a5  b5  c5     12  1985
...

3

考虑到样本数据

df = pd.DataFrame({'Date1': ["1Jan1990", "11May1986", "1Dec1984"],
                   'Date2': ["5Jul1990", "11May1986", "7Apr1985"],
                   'A': ['a1', 'a3', 'a5'],
                   'B': ['b1', 'b3', 'b5'],
                   'C': ['c1', 'c3', 'c5'],})  

以下是一种无需明确迭代的解决方案

# Convert to pandas datetime
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])

# Split and stack by dates
df = pd.concat([df.drop('Date2', 1).rename(columns={'Date1': 'Date'}),
                df.drop('Date1', 1).rename(columns={'Date2': 'Date'})])
df = df.drop_duplicates().set_index('Date')

# Break down by dates
df = (df.groupby(['A', 'B', 'C'], as_index=False)
      .resample('M') # with end of month interval
      .ffill() # propagating everything else forward
      .reset_index(level=0, drop=True)) # getting rid of auxiliary index

# Get the year and a month
df['Year'] = df.index.year
df['Month'] = df.index.month

导致
             A   B   C  Year  Month
Date                               
1990-01-31  a1  b1  c1  1990      1
1990-02-28  a1  b1  c1  1990      2
1990-03-31  a1  b1  c1  1990      3
1990-04-30  a1  b1  c1  1990      4
1990-05-31  a1  b1  c1  1990      5
1990-06-30  a1  b1  c1  1990      6
1990-07-31  a1  b1  c1  1990      7
1986-05-31  a3  b3  c3  1986      5
1984-12-31  a5  b5  c5  1984     12
1985-01-31  a5  b5  c5  1985      1
1985-02-28  a5  b5  c5  1985      2
1985-03-31  a5  b5  c5  1985      3
1985-04-30  a5  b5  c5  1985      4

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接