pandas - 如何扩展DataFrame的索引并将新行的所有列设置为NaN?

41

我有时间索引数据:

df2 = pd.DataFrame({ 'day': pd.Series([date(2012, 1, 1), date(2012, 1, 3)]), 'b' : pd.Series([0.22, 0.3]) })
df2 = df2.set_index('day')
df2
               b
 day             
2012-01-01  0.22
2012-01-03  0.30

如何最好地扩展这个数据框,使其具有2012年1月每天一行的数据(假设),其中所有列在没有数据的情况下都设置为NaN(仅在此例中是b)?

因此,期望的结果将是:

               b
 day             
2012-01-01  0.22
2012-01-02   NaN
2012-01-03  0.30
2012-01-04   NaN
...
2012-01-31   NaN

非常感谢!

6个回答

37

使用这个(截至pandas 1.1.3的最新版本):

ix = pd.date_range(start=date(2012, 1, 1), end=date(2012, 1, 31), freq='D')
df2.reindex(ix)

这将会得到:

               b
2012-01-01  0.22
2012-01-02   NaN
2012-01-03  0.30
2012-01-04   NaN
2012-01-05   NaN
[...]
2012-01-29   NaN
2012-01-30   NaN
2012-01-31   NaN

对于较旧版本的pandas,请使用pd.DatetimeIndex替换pd.date_range


7
您可以将过去的天数重新采样为频率,如果未指定fill_method参数,则缺失值将按照您的要求填充为NaN
df3 = df2.asfreq('D')
df3

Out[16]:
               b
2012-01-01  0.22
2012-01-02   NaN
2012-01-03  0.30

回答你的第二部分,我暂时想不到更优美的方法:
df3 = DataFrame({ 'day': Series([date(2012, 1, 4), date(2012, 1, 31)])})
df3.set_index('day',inplace=True)
merged = df2.append(df3)
merged = merged.asfreq('D')
merged


Out[46]:
               b
2012-01-01  0.22
2012-01-02   NaN
2012-01-03  0.30
2012-01-04   NaN
2012-01-05   NaN
2012-01-06   NaN
2012-01-07   NaN
2012-01-08   NaN
2012-01-09   NaN
2012-01-10   NaN
2012-01-11   NaN
2012-01-12   NaN
2012-01-13   NaN
2012-01-14   NaN
2012-01-15   NaN
2012-01-16   NaN
2012-01-17   NaN
2012-01-18   NaN
2012-01-19   NaN
2012-01-20   NaN
2012-01-21   NaN
2012-01-22   NaN
2012-01-23   NaN
2012-01-24   NaN
2012-01-25   NaN
2012-01-26   NaN
2012-01-27   NaN
2012-01-28   NaN
2012-01-29   NaN
2012-01-30   NaN
2012-01-31   NaN

这将构建第二个时间序列,然后我们只需像之前一样追加并调用 asfreq('D') 即可。

谢谢 - 这很好填补空缺,但我如何扩展到 2012-01-31(比方说)。 - paul
嗯。但是如果我的原始时间序列中有多个空洞/间隙,那么这种方法就不再适用了。 - paul
@paul 是的,我的回答在这方面有限,我想不出更通用的方法了。如果可以的话,最好一开始就创建包含所有预期值的DataFrame。我会试着玩一下,看看能否想出更好的方法。 - EdChum

3
这里有另一个选项: 首先在您想要的最后一天添加一个NaN记录,然后重新采样。这样重新采样将为您填充缺失的日期。
起始帧:
import pandas as pd
import numpy as np
from datetime import date

df2 = pd.DataFrame({ 'day': pd.Series([date(2012, 1, 1), date(2012, 1, 3)]), 'b' : pd.Series([0.22, 0.3]) })
df2= df2.set_index('day')
df2

Out:
                  b
    day 
    2012-01-01  0.22
    2012-01-03  0.30

Filled Frame:

df2 = df2.set_value(date(2012,1,31),'b',np.float('nan'))
df2.asfreq('D')

Out:
                b
    day 
    2012-01-01  0.22
    2012-01-02  NaN
    2012-01-03  0.30
    2012-01-04  NaN
    2012-01-05  NaN
    2012-01-06  NaN
    2012-01-07  NaN
    2012-01-08  NaN
    2012-01-09  NaN
    2012-01-10  NaN
    2012-01-11  NaN
    2012-01-12  NaN
    2012-01-13  NaN
    2012-01-14  NaN
    2012-01-15  NaN
    2012-01-16  NaN
    2012-01-17  NaN
    2012-01-18  NaN
    2012-01-19  NaN
    2012-01-20  NaN
    2012-01-21  NaN
    2012-01-22  NaN
    2012-01-23  NaN
    2012-01-24  NaN
    2012-01-25  NaN
    2012-01-26  NaN
    2012-01-27  NaN
    2012-01-28  NaN
    2012-01-29  NaN
    2012-01-30  NaN
    2012-01-31  NaN

3

Mark的答案在pandas 1.1.1上似乎不再起作用。

然而,使用相同的思路,以下方法可行:

from datetime import datetime
import pandas as pd


# get start and desired end dates
first_date = df['date'].min()
today = datetime.today()

# set index
df.set_index('date', inplace=True)

# and here is were the magic happens
idx = pd.date_range(first_date, today, freq='D')
df = df.reindex(idx)

编辑:刚刚发现这个确切的用例在文档中:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html#pandas.DataFrame.reindex


1
def extendframe(df, ndays):
    """
    (df, ndays) -> df that is padded by ndays in beginning and end
    """
    ixd = df.index - datetime.timedelta(ndays)
    ixu = df.index + datetime.timedelta(ndays)
    ixx = df.index.union(ixd.union(ixu))
    df_ = df.reindex(ixx)
    return df_

0

这并不是一个问题,因为在这里你知道第二个索引是一月中的所有天,但是假设你有另一个索引,比如来自另一个数据框df1,它可能是不连续的,并且具有随机频率。那么你可以这样做:

ix = pd.DatetimeIndex(list(df2.index) + list(df1.index)).unique().sort_values()
df2.reindex(ix)

将索引转换为列表可以自然地创建一个更长的列表。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接