在pandas中，根据groupby元素的大小创建新的列

Question

在pandas中，根据groupby元素的大小创建新的列

3

这是一个数据框的示例：

    id  Section A   B
0   abc foo 0.1 0.6
1   abc foo 0.2 0.3
2   abc bar 0.5 0.1
3   def foo 0.1 0.1
4   def bar 0.1 0.3
5   def bar 0.6 0.1
6   ghj foo 0.3 0.1
7   ghj foo 0.1 0.7
8   ghj bar 0.1 0.2

以下列表将创建新列df['AA', 'BB']。

A_foo = [0.1,2]
A_bar = [1,0.3]

B_foo = [0.4,0.2]
B_bar = [1.2,0.5]

这是我迄今为止尝试的方法，

g = df.groupby('id')['A','B']
for i, i_d in g:
    print(i_d)

**

length of `A_foo, A_bar, B_foo and B_bar` is always greater or equal to df`

[df.Section == 'foo'] and df[df.Section == 'bar']` of any unique id.

为了创建 df['AA']，对于每个id中 df['Section'] 中的 'foo' 和 'bar'，我希望提取相应的 A_foo 和 A_bar 的值。

例如，在第一个i_d（id = abc）中，df.A 有 两个 'foo' 和一个 'bar'，那么 df['AA'] 的前三行如下：

[0.1,2,1... 0.1 and 2 from A_foo and 1 from A_bar

在第二个i_d(id='def')中，df.A有一个foo和两个bar，因此我需要添加A_foo的0.1和A_bar的1.0.3。

现在

df['AA'] will look [0.1,2,1,0.1,1,0.3...

从上一个i_d开始，我将从A_foo中收集0.1,2，从A_bar中收集1。

df['AA'] = [0.1,2,1,0.1,1,0.3,0.1,2,1]

同样地，使用 B_foo 和 B_bar 创建 df['BB']。

df['BB'] = [0.4,0.2,1.2,0.4,1.2,0.5,0.4,0.2,1.2]

这是最终的数据框（df）

    id  Section A   B   AA  BB
0   abc foo    0.1  0.6 0.1 0.4
1   abc foo    0.2  0.3 2.0 0.2
2   abc bar    0.5  0.1 1.0 1.2
3   def foo    0.1  0.1 0.1 0.4
4   def bar    0.1  0.3 1.0 1.2
5   def bar    0.6  0.1 0.3 0.5
6   ghj foo    0.3  0.1 0.1 0.4
7   ghj foo    0.1  0.7 2.0 0.2
8   ghj bar    0.1  0.2 1.0 1.2

- A.Z

如果一个“id”具有比“foo_list”的长度更多的“foos”，那该怎么办？ - ALollz

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ALollz · Accepted Answer

使用groupby + cumcount创建索引，然后使用np.select从各自的列表中分配值。将原始答案翻译成“最初的回答”。

import numpy as np

df['idx'] = df.groupby(['id', 'Section']).cumcount()

conds = [df.Section.eq('foo'), df.Section.eq('bar')]
AA_choice = [np.array(A_foo)[df.idx], np.array(A_bar)[df.idx]]
BB_choice = [np.array(B_foo)[df.idx], np.array(B_bar)[df.idx]]

df['AA'] = np.select(conds, AA_choice, default=np.NaN)
df['BB'] = np.select(conds, BB_choice, default=np.NaN)

输出：

    id Section    A    B  idx   AA   BB
0  abc     foo  0.1  0.6    0  0.1  0.4
1  abc     foo  0.2  0.3    1  2.0  0.2
2  abc     bar  0.5  0.1    0  1.0  1.2
3  def     foo  0.1  0.1    0  0.1  0.4
4  def     bar  0.1  0.3    0  1.0  1.2
5  def     bar  0.6  0.1    1  0.3  0.5
6  ghj     foo  0.3  0.1    0  0.1  0.4
7  ghj     foo  0.1  0.7    1  2.0  0.2
8  ghj     bar  0.1  0.2    0  1.0  1.2

如果您的列表不够长，您将会得到一个“IndexError”错误。如果是这样，请考虑使用切片：np.array(A_foo)[df.idx%len(A_foo)]。最初的回答。