Pandas删除面板数据中不完整时间序列的行。

3

我有一个类似于下面的数据框:

    Country1  Country2    Year
1    Italy       Greece   2000
2    Italy       Greece   2001  
3    Italy       Greece   2002
4    Germany      Italy   2000
5    Germany      Italy   2002
6    Mexico      Canada   2000
7    Mexico      Canada   2001
8    Mexico      Canada   2002
9    US          France   2000
10   US          France   2001
11   Greece      Italy    2000
12   Greece      Italy    2001

我想要保留只包含整个时间序列(2000-2002)的观测值的行。因此,最终结果应该是:

    Country1  Country2    Year
1    Italy       Greece   2000
2    Italy       Greece   2001  
3    Italy       Greece   2002
4    Mexico      Canada   2000
5    Mexico      Canada   2001
6    Mexico      Canada   2002
2个回答

3

一个想法是通过crosstab重塑数据,并使用DataFrame.neDataFrame.all测试行是否有非0值,将索引转换为DataFrame通过MultiIndex.to_frame,最后在DataFrame.merge中获取筛选后的行:

df1 = pd.crosstab([df['Country1'], df['Country2']], df['Year'])

df = df.merge(df1.index[df1.ne(0).all(axis=1)].to_frame(index=False))
print (df)
  Country1 Country2  Year
0    Italy   Greece  2000
1    Italy   Greece  2001
2    Italy   Greece  2002
3   Mexico   Canada  2000
4   Mexico   Canada  2001
5   Mexico   Canada  2002

如果需要测试某些特定的范围,可以在GroupBy.transform中比较set

r = set(range(2000, 2003))

df = df[df.groupby(['Country1', 'Country2'])['Year'].transform(lambda x: set(x) == r)]
print (df)
  Country1 Country2  Year
1    Italy   Greece  2000
2    Italy   Greece  2001
3    Italy   Greece  2002
6   Mexico   Canada  2000
7   Mexico   Canada  2001
8   Mexico   Canada  2002

1

一种方法是将数据透视,删除空行并重新调整形状;这仅在Country*Year的组合是唯一的情况下有效(在样本数据中是唯一的):

(df.assign(dummy = 1)
   .pivot(('Country1', 'Country2'), 'Year')
   .dropna()
   .stack()
   .drop(columns='dummy')
   .reset_index()
)
 
  Country1 Country2  Year
0    Italy   Greece  2000
1    Italy   Greece  2001
2    Italy   Greece  2002
3   Mexico   Canada  2000
4   Mexico   Canada  2001
5   Mexico   Canada  2002

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接