如何在pandas中将所有列与一个列进行比较？

Question

如何在pandas中将所有列与一个列进行比较？

6

以下是df的内容：

                A       B       ..... THRESHOLD             
DATE                                       
2011-01-01       NaN       NaN  .....      NaN   
2012-01-01 -0.041158 -0.161571  ..... 0.329038   
2013-01-01  0.238156  0.525878  ..... 0.110370   
2014-01-01  0.606738  0.854177  ..... -0.095147   
2015-01-01  0.200166  0.385453  ..... 0.166235

我需要比较 N 个列，如 A、B、C...，并将其与阈值进行比较，然后输出结果。

df['A_CALC'] = np.where(df['A'] > df['THRESHOLD'], 1, -1)
df['B_CALC'] = np.where(df['B'] > df['THRESHOLD'], 1, -1)

如何应用上述方法到所有列（A，B，C...）而不需要逐个列写一个语句？

- Shakti

4个回答

3

也许你可以尝试这样做，使用subtract来做这件事比使用apply更快。

(df.drop(['THRESHOLD'],axis=1).subtract(df.THRESHOLD,axis=0)>0)\
    .astype(int).replace({0:-1}).add_suffix('_CALC')

- BENY

0

以下内容是否足够？

for col in df.columns.values:
    if col!= 'THRESHOLD':
        newname = col+'_CALC'
        df[newname] = np.where(df[col] > df['THRESHOLD'], 1, -1)

- durbachit

在操作pandas列时，不建议使用for循环。 - cs95

哎呀！为什么会这样？虽然我可以想象这很耗时间，但我从来没有遇到过这个问题。 - durbachit

正是因为它耗时严重 :] - cs95

1

当你处理大规模数据时，你会发现那些for循环会消耗运行时间，使得可能的事情变得不可能。 - BENY

0

我需要将一些列与一个列进行比较（更改一些列并保持一些列不变）。我使用了cs95上面的答案并设置了索引。

您想要保留的列放在索引中（假设为col1和col2）。
如果任何不在索引中的列大于col2，则它会得到1，否则为0。

数据：

df=pd.DataFrame({'col1':range(10,15), 'col2':range(1,6), 'col3':np.random.randn(5)+3,'col4':np.random.randn(5)+3,'col5':np.random.randn(5)})

    col1    col2    col3        col4        col5
0   10      1       2.741873    2.402274    -1.208714
1   11      2       3.328949    2.692367    -0.813730
2   12      3       5.074692    3.155199    -0.721969
3   13      4       2.725135    3.393867    -2.452344
4   14      5       3.626220    3.002514    -0.897204

代码：

import numpy as np

df['col2_copy'] = df['col2']
df=df.set_index(['col1','col2'])
df=df.apply(lambda x: np.where(x > df['col2_copy'], 1, 0), axis=0).reset_index().drop(['col2_copy'],axis = 1)

输出：

    col1    col2    col3    col4    col5
0   10      1       1       1       0
1   11      2       1       1       0
2   12      3       1       1       0
3   13      4       0       0       0
4   14      5       0       0       0

- Grant Shannon

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- cs95 · Accepted Answer

你可以使用df.apply函数：

In [670]: df.iloc[:, :-1]\
            .apply(lambda x: np.where(x > df.THRESHOLD, 1, -1), axis=0)\
            .add_suffix('_CALC')
Out[670]: 
            A_CALC  B_CALC
Date                      
2011-01-01      -1      -1
2012-01-01      -1      -1
2013-01-01       1       1
2014-01-01       1       1
2015-01-01       1       1

如果THRESHOLD不是你的最后一列，最好使用

df[df.columns.difference(['THRESHOLD'])].apply(lambda x: np.where(x > df.THRESHOLD, 1, -1), axis=0).add_suffix('_CALC')