如何在pandas中将多个日期列合并为一个?

4

我有一个包含多个日期列及其值的数据框:

date         value_1      date        value_2    date         value_3
01-01-1990   1            01-01-1990  2          02-01-1990   4
02-01-1990   3            03-01-1990  20         
                          04-01-1990  30  

输出:将所有日期列合并为超集日期列,并相应地显示值。
date         value_1        value_2    value_3
01-01-1990   1              2          
02-01-1990   3                         4
03-01-1990                  20
04-01-1990                  30 
2个回答

5

首先,需要对具有相同列名的日期值列进行去重:

s = df.columns.to_series()
mask = df.columns.duplicated(keep=False)
c = np.where(mask, s + '_'  + (s.groupby(s).cumcount() + 1).astype(str) , s)
df.columns = c
print (df)
       date_1  value_1      date_2  value_2      date_3  value_3
0  01-01-1990      1.0  01-01-1990        2  02-01-1990      4.0
1  02-01-1990      3.0  03-01-1990       20         NaN      NaN
2         NaN      NaN  04-01-1990       30         NaN      NaN

然后使用lambda函数和所有对分组进行循环,创建日期列,删除缺失值,并最后使用concat合并:

dfs = [x.set_index(x.columns[0]).dropna() 
       for i, x in df.groupby(lambda x: x.split('_')[1], axis=1)]
#print (dfs)

df2 = pd.concat(dfs, axis=1)
print (df2)
            value_1  value_2  value_3
01-01-1990      1.0      2.0      NaN
02-01-1990      3.0      NaN      4.0
03-01-1990      NaN     20.0      NaN
04-01-1990      NaN     30.0      NaN

编辑:

针对日期时间列和接下来的两个数据值列,答案已经更改:

print (df)
  date_security  GH_LAST_PRICE Val  GH_VOLUME_PRICE Val date_security  \
0    01-01-1990                1.0                  7.0    01-01-1990   
1    01-02-1990                3.0                  8.0    03-01-1990   
2           NaN                NaN                  NaN    04-01-1990   

   DG_LAST_PRICE Val  DG_VOLUME_PRICE Val  
0                  2                 10.0  
1                 20                  NaN  
2                 30                  1.0 

创建 MultiIndex:
df.columns = [(np.arange(len(df.columns)) // 3).astype(str), df.columns]
print (df)
#              0                                                   1  \
  date_security GH_LAST_PRICE Val GH_VOLUME_PRICE Val date_security   
0    01-01-1990               1.0                 7.0    01-01-1990   
1    01-02-1990               3.0                 8.0    03-01-1990   
2           NaN               NaN                 NaN    04-01-1990   


  DG_LAST_PRICE Val DG_VOLUME_PRICE Val  
0                 2                10.0  
1                20                 NaN  
2                30                 1.0  

dfs = [x.set_index(x.columns[0]).dropna()
       for i, x in df.groupby(level=0, axis=1)]


df2 = pd.concat(dfs, axis=1)
#flatten MultiIndex
df2.columns = df2.columns.map('_'.join)
print (df2)
            0_GH_LAST_PRICE Val  0_GH_VOLUME_PRICE Val  1_DG_LAST_PRICE Val  \
01-01-1990                  1.0                    7.0                  2.0   
01-02-1990                  3.0                    8.0                  NaN   
04-01-1990                  NaN                    NaN                 30.0   

            1_DG_VOLUME_PRICE Val  
01-01-1990                   10.0  
01-02-1990                    NaN  
04-01-1990                    1.0  

数据错误:没有数值类型可以聚合 - 尽管我已经转换为浮点数。 - user3222101
我的实际列名看起来像 date_security,GH_LAST_PRICE Val,GH_VOLUME_PRICE Val,date_security,DG_LAST_PRICE Val,DG_VOLUME_PRICE Val -> 你认为这可能会引起问题吗? - user3222101
@user3222101 - 没错,这就是问题。所以只有6列吗? - jezrael
实际上 - 有30列,每列中包含10个日期安全性。 - user3222101
@user3222101 - 数据保密吗? - jezrael
显示剩余10条评论

2

一种方法是将日期-值对垂直堆叠到数据框中

df.columns = ['date_1', 'value_1', 'date_2', 'value_2', 'date_3', 'value_3']

>>> new_df = pd.concat([df[['date_1', 'value_1']].rename(columns={'date_1': 'date'}), df[['date_2', 'value_2']].rename(
columns={'date_2': 'date'}), df[['date_3', 'value_3']].rename(columns={'date_3': 'date'})]).dropna(how='all')

         date  value_1  value_2  value_3
0  01-01-1990      1.0      NaN      NaN
1  02-01-1990      3.0      NaN      NaN
2  01-01-1990      NaN      2.0      NaN
3  03-01-1990      NaN     20.0      NaN
4  04-01-1990      NaN     30.0      NaN
5  02-01-1990      NaN      NaN      4.0

然后进行分组(groupby)

new_df.groupby('date',as_index=False).apply(lambda x:x.ffill().bfill().drop_duplicates())

           date  value_1  value_2  value_3
0 0  01-01-1990      1.0      2.0      NaN
1 1  02-01-1990      3.0      NaN      4.0
2 3  03-01-1990      NaN     20.0      NaN
3 4  04-01-1990      NaN     30.0      NaN

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接