根据其他列的条件创建新列

3
 > df = pd.DataFrame({"A": ["2002-01-12","2002-01-12","2002-01-12","2002-01-13","2002-01-13","2002-01-13","2002-01-16","2002-01-16","2002-01-16"], "B": ["12:00:00", "13:00:00", "14:00:00","11:00:00", "12:00:00", "13:00:00", "10:00:00", "11:00:00", "12:00:00"], "C": [ 3,19, 15, 6, 1, 5, 3, 12, 8]})

           A         B   C
0 2002-01-12  12:00:00   3
1 2002-01-12  13:00:00  19
2 2002-01-12  14:00:00  15
3 2002-01-13  11:00:00   6
4 2002-01-13  12:00:00   1
5 2002-01-13  13:00:00   5
6 2002-01-16  10:00:00   3
7 2002-01-16  11:00:00  12
8 2002-01-16  12:00:00   8

我想创建一个新的df['D']df['E'],并根据每个A组的以下条件进行操作:
  • df['D']:获取前一天(与A组相关)B == 12:00:00时刻的C值。
  • df['E']:获取前一天(与A组相关)C值的平均数。
输出应为:
           A         B   C    D     E
0 2002-01-12  12:00:00   3    0     0
1 2002-01-12  13:00:00  19    0     0
2 2002-01-12  14:00:00  15    0     0
3 2002-01-13  11:00:00   6    3  12.3
4 2002-01-13  12:00:00   1    3  12.3
5 2002-01-13  13:00:00   5    3  12.3
6 2002-01-16  10:00:00   3    1   4.0
7 2002-01-16  11:00:00  12    1   4.0
8 2002-01-16  12:00:00   8    1   4.0

每天都存在12:00:00的值吗? - jezrael
是的,它确实存在。 - Tie_24
2个回答

3
你可以为每一天创建一个帮助者 Series,对于前一天,将shiftmap添加到新列中,最后用fillna替换所有的NaN值:
a = df[df['B'].eq('12:00:00')].set_index('A')['C'].shift(1)
b = df.groupby('A')['C'].mean().shift(1)

df['D'] = df['A'].map(a)
df['E'] = df['A'].map(b)
df[['D','E']] = df[['D','E']].fillna(0)
print (df)
           A         B   C    D          E
0 2002-01-12  12:00:00   3  0.0   0.000000
1 2002-01-12  13:00:00  19  0.0   0.000000
2 2002-01-12  14:00:00  15  0.0   0.000000
3 2002-01-13  11:00:00   6  3.0  12.333333
4 2002-01-13  12:00:00   1  3.0  12.333333
5 2002-01-13  13:00:00   5  3.0  12.333333
6 2002-01-16  10:00:00   3  1.0   4.000000
7 2002-01-16  11:00:00  12  1.0   4.000000
8 2002-01-16  12:00:00   8  1.0   4.000000

有没有其他方法替代 df[df['B'].eq('12:00:00')].set_index('A')['C'].shift(1) ? 当我执行时,会出现以下错误:InvalidIndexError: Reindexing only valid with uniquely valued Index objects - Tie_24
@Tie_24 - 这意味着列A存在重复值。您可以通过 df1=df[df['B'].eq('12:00:00')]print(df1[df1.A.duplicated(keep=False)]) 来检查它。可能的解决方案是删除重复项并保留第一个值,例如 df[df['B'].eq('12:00:00')].drop_duplicates(subset=['A']).set_index('A')['C'].shift(1) 或者聚合重复项,例如 df[df['B'].eq('12:00:00')].groupby('A')['C'].mean().shift(1) - jezrael
@Tie_24 - 很高兴能够帮助你的有趣问题 :) - jezrael

0

我做了一个更健壮的,但是运作正常:

df['A'] = pd.to_datetime(df['A'])

df['D'] = df['A'].apply(lambda x: df[(df['A']==(x + pd.DateOffset(-1)))&(df['B']=='12:00:00')]['C'].mean()).fillna(0)
df['E'] = df['A'].apply(lambda x: df[df['A']==(x + pd.DateOffset(-1))]['C'].mean()).fillna(0)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接