基于随机选择的区间,获取数据框各列的均值和总和

3

我有一个如下的数据框。

数据

Index   ID  AA  BB  CC  BIN
0       Z1  10  11  12  1
1       Z1  0   12  13  1
2       Z1  20  13  14  2
3       Z1  34  14  15  3
4       Z1  54  52  16  3
5       Z1  67  53  17  3
6       Z7  45  54  18  1
7       Z7  34  55  19  2
8       Z7  45  56  57  2
9       Z7  45  56  58  3
10      Z7  67  67  59  3

我希望获得一个类似下面的数据框:
data2
ID   AA_SUM_12  AA_MEAN_12  BB_SUM_12  BB_MEAN_12  CC_SUM_12  CC_MEAN_12
Z1   30         10          36         12          39         13
Z7   124        41.33       165        55          94         31.33

SUM_12 求和的范围是 'BIN' = 1 and 2MEAN_12 的概念相同。

在真实数据集中,有超过3000个不同的ID,'BIN' 的取值范围为 15

我想随机选取 'BIN',例如选取 'BIN'135 时求均值,选取 'BIN'45 时求和,并将结果以数据框形式呈现。

如何实现?


你能提供可重现的数据框代码吗? - Mohit Sharma
看起来是在 BIN 列上进行分组过滤。不过我对于 bin 可能出现的随机值有些困惑。 - Celius Stingher
1
我想从数据框中随机选择'BIN',当'BIN'为1、3、5时取平均值,当'BIN'为4、5时取和。请问您可以提供更多的细节吗?因为'1,3,5'的长度是随机的,可能是3,而'4,5'的长度也是随机的,可能是2。 - jezrael
随机选择的组在哪里? - jezrael
3个回答

1
所以这是你的Pandas数据框:
>>> df = pd.DataFrame(
...     [
...             ['Z1', 10, 11, 12, 1],
...             ['Z1', 0,  12, 13, 1],
...             ['Z1', 20, 13, 14, 2],
...             ['Z1', 34, 14, 15, 3],
...             ['Z1', 54, 52, 16, 3],
...             ['Z1', 67, 53, 17, 3],
...             ['Z7', 45, 54, 18, 1],
...             ['Z7', 34, 55, 19, 2],
...             ['Z7', 45, 56, 57, 2],
...             ['Z7', 45, 56, 58, 3],
...             ['Z7', 67, 67, 59, 3],
... 
...     ], columns=['ID', 'AA', 'BB', 'CC', 'BIN']
... )
>>> 
>>> 
>>> df
    ID  AA  BB  CC  BIN
0   Z1  10  11  12    1
1   Z1   0  12  13    1
2   Z1  20  13  14    2
3   Z1  34  14  15    3
4   Z1  54  52  16    3
5   Z1  67  53  17    3
6   Z7  45  54  18    1
7   Z7  34  55  19    2
8   Z7  45  56  57    2
9   Z7  45  56  58    3
10  Z7  67  67  59    3

现在以下内容应该能解决问题:

# Randomly select 2 bins out of [1, 2, 3, 4, 5]
bins = random.sample(set([1, 3, 5]), 2)

final_df = df.loc[df['BIN'].isin(bins)]\
  .groupby('ID')\
  .agg(AA_SUM_=('AA', 'sum'), AA_MEAN_=('AA', 'mean'), BB_SUM_=('BB', 'sum'), BB_MEAN_=('BB', 'mean'), CC_SUM_=('CC', 'sum'), CC_MEAN_=('CC', 'mean'))

# Rename columns accordingly
suffix = ''.join([str(x) for x in bins])
final_df.columns = [c + suffix for c in final_df.columns]

而输出将是:

    AA_SUM_12  AA_MEAN_12  BB_SUM_12  BB_MEAN_12  CC_SUM_12  CC_MEAN_12
ID                                                                     
Z1         30   10.000000         36          12         39   13.000000
Z7        124   41.333333        165          55         94   31.333333

1
我理解这个问题需要随机生成长度为2或3的唯一的BIN
print (df)
    ID  AA  BB  CC  BIN
0   Z1  10  11  12    1
1   Z1   0  12  13    1
2   Z1  20  13  14    2
3   Z1  34  14  15    4
4   Z1  54  52  16    5
5   Z1  67  53  17    3
6   Z7  45  54  18    4
7   Z7  34  55  19    2
8   Z7  45  56  57    4
9   Z7  45  56  58    3
10  Z7  67  67  59    3

首先获取所有唯一值:

v = df['BIN'].unique()
print (v)
[1 2 4 5 3]

并使用生成的随机长度23传递给numpy.random.choice

r = np.random.choice(v, size=np.random.choice([2,3]))
print (r)
[3 5 1]

new = ''.join((str(x) for x in r))

然后按照Series.isin布尔索引过滤行,并使用mean聚合sum,最后将生成的BINS转换为string并添加到列名中:

df1 = df[df['BIN'].isin(r)].groupby('ID')[ 'AA', 'BB', 'CC'].agg(['mean','sum'])
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}_{new}')
print (df1)
    AA_mean_351  AA_sum_351  BB_mean_351  BB_sum_351  CC_mean_351  CC_sum_351
ID                                                                           
Z1        32.75         131         32.0         128         14.5          58
Z7        56.00         112         61.5         123         58.5         117

1
import random as r


def mean(listin):
    if len(listin) == 0:
        return "X"
    a = 0
    for  _i in listin:
        a += _i
    return (a)/len(listin)

def dsum(listin):
    print(listin)
    a = 0
    for  _i in listin:
        a += _i
    return a

def random_d():
    d = {"ID":"Z"+str((r.randint(0,10))),
         "AA":r.randint(0,70),
         "BB":r.randint(0,70),
         "CC":r.randint(0,70),
         "BIN":r.randint(1,3),}

    return d



def data_bin_z(z,bin1,bin2):
    z = "Z"+str(z)
    AA_SUM_name = "AA_SUM_"+str(bin1)+str(bin2)
    AA_MEAN_name = "AA_MEAN_"+str(bin1)+str(bin2)
    BB_SUM_name = "BB_SUM_"+str(bin1)+str(bin2)
    BB_MEAN_name = "BB_MEAN_"+str(bin1)+str(bin2)
    CC_SUM_name = "CC_SUM_"+str(bin1)+str(bin2)
    CC_MEAN_name = "CC_MEAN_"+str(bin1)+str(bin2)
    AA_SUM = []
    AA_MEAN = []
    BB_SUM = []
    BB_MEAN = []
    CC_SUM = []
    CC_MEAN = []

    for _dict in data:
        if _dict["ID"] == z:
            _bin = _dict["BIN"]
            if _bin == bin1 or _bin == bin2:
                AA_SUM.append(_dict["AA"])
                AA_MEAN.append(_dict["AA"])
                BB_SUM.append(_dict["BB"])
                BB_MEAN.append(_dict["BB"])
                CC_SUM.append(_dict["CC"])
                CC_MEAN.append(_dict["CC"])


    AA_SUM = dsum(AA_SUM)
    AA_MEAN = mean(AA_MEAN)
    BB_SUM = dsum(BB_SUM)
    BB_MEAN = mean(BB_MEAN)
    CC_SUM = dsum(CC_SUM)
    CC_MEAN = mean(CC_MEAN)

    ret = {"ID":z,
           AA_SUM_name:AA_SUM,
           AA_MEAN_name:AA_MEAN,
           BB_SUM_name:BB_SUM,
           BB_MEAN_name:BB_MEAN,
           CC_SUM_name:CC_SUM,
           CC_MEAN_name:CC_MEAN}

    return ret

#Put dataframe here
data = []

for _i in range(100):
    data.append(random_d())

#------------------

out_data = []
bin_combs = [[1,2],[1,3],[1,4],[1,5],[2,3],[2,4],[2,5],[3,4],[3,5],[4,5]]
for _ID in range(10):
    for comb in bin_combs:
        out_data.append(data_bin_z(_ID,comb[0],comb[1]))


print(out_data)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接