我有一个看起来像这样的数据框
数据框有11列,每一列都有一个等级。对于每个记录,我需要计算其中A、B和C的数量。这是我的预期输出结果。
我尝试使用apply
函数来实现。目前为止,我的代码如下:
import pandas as pd
# sample data
df_dict = {'level_1': {0: 'C', 1: 'A', 2: 'C', 3: 'B', 4: 'A', 5: 'C', 6: 'A', 7: 'B', 8: 'B'},
'level_2': {0: 'B', 1: 'B', 2: 'C', 3: 'A', 4: 'A', 5: 'C', 6: 'B', 7: 'C', 8: 'A'},
'level_3': {0: 'B', 1: 'A', 2: 'B', 3: 'A', 4: 'B', 5: 'B', 6: 'C', 7: 'B', 8: 'C'},
'level_4': {0: 'A', 1: 'C', 2: 'B', 3: 'C', 4: 'B', 5: 'C', 6: 'A', 7: 'B', 8: 'C'},
'level_5': {0: 'B', 1: 'B', 2: 'B', 3: 'A', 4: 'A', 5: 'A', 6: 'B', 7: 'B', 8: 'A'},
'level_6': {0: 'C', 1: 'C', 2: 'C', 3: 'B', 4: 'B', 5: 'B', 6: 'C', 7: 'A', 8: 'C'},
'level_7': {0: 'C', 1: 'A', 2: 'C', 3: 'C', 4: 'C', 5: 'C', 6: 'C', 7: 'A', 8: 'A'},
'level_8': {0: 'B', 1: 'A', 2: 'B', 3: 'B', 4: 'B', 5: 'A', 6: 'A', 7: 'A', 8: 'C'},
'level_9': {0: 'A', 1: 'B', 2: 'A', 3: 'C', 4: 'C', 5: 'B', 6: 'A', 7: 'C', 8: 'B'},
'level_10': {0: 'B', 1: 'C', 2: 'A', 3: 'A', 4: 'A', 5: 'A', 6: 'A', 7: 'A', 8: 'C'},
'level_11': {0: 'C', 1: 'B', 2: 'C', 3: 'B', 4: 'C', 5: 'B', 6: 'B', 7: 'C', 8: 'B'}
}
sample_df = pd.DataFrame(df_dict)
# function to count the values of A, B, C
def count_in_df(series):
_ = series.value_counts()
_ = _[['A', 'B', 'C']]
return _.tolist()
count_df = pd.DataFrame(sample_df.apply(count_in_df, axis=1).tolist(),
columns=['counts_of_A', 'counts_of_B', 'counts_of_C'])
# add count information back
sample_df = sample_df.join(count_df)
这提供了我需要的信息,但问题在于代码太慢了。我有大约70万条记录和66个列(而不是11个),需要执行此操作,这花费了我约30分钟才得到结果。
有没有办法可以加速代码?有哪些优化方法可以尝试?