根据y分量，找到一组坐标数组的x分量的平均值。

Question

根据y分量，找到一组坐标数组的x分量的平均值。

4

我有以下示例数组，其中包含 x-y 坐标对：

A = np.array([[0.33703753, 3.],
              [0.90115394, 5.],
              [0.91172016, 5.],
              [0.93230994, 3.],
              [0.08084283, 3.],
              [0.71531777, 2.],
              [0.07880787, 3.],
              [0.03501083, 4.],
              [0.69253184, 4.],
              [0.62214452, 3.],
              [0.26953094, 1.],
              [0.4617873 , 3.],
              [0.6495549 , 0.],
              [0.84531478, 4.],
              [0.08493308, 5.]])

我的目标是通过对每个y值的x值求平均值，将其缩减为一个具有六行的数组，如下所示：

array([[0.6495549 , 0.        ],
       [0.26953094, 1.        ],
       [0.71531777, 2.        ],
       [0.41882167, 3.        ],
       [0.52428582, 4.        ],
       [0.63260239, 5.        ]])

目前我是通过将其转换为pandas数据框，执行计算，然后再转换回numpy数组来实现的：

>>> df = pd.DataFrame({'x':A[:, 0], 'y':A[:, 1]})
>>> df.groupby('y').mean().reset_index()
     y         x
0  0.0  0.649555
1  1.0  0.269531
2  2.0  0.715318
3  3.0  0.418822
4  4.0  0.524286
5  5.0  0.632602

有没有使用numpy执行此计算的方法，而不必诉诸于pandas库？

- CDJB

这个回答解决了你的问题吗？有没有numpy的分组函数？ - Pranav Hosangadi

很遗憾，@PranavHosangadi，对于那个问题的答案只生成x坐标列表，而不保留y坐标或执行平均计算。 - CDJB

已经提出了聪明的np答案。但是不使用产生可读的一行代码的Pandas是否有任何好处？ - user19077881

1

如果这是导入pandas的唯一原因，那么仅使用numpy的答案可以避免需要额外的库。 - Pranav Hosangadi

1

@user19077881 我在下面的答案中添加了一些关于pandas和仅使用numpy方法之间时间比较的内容。Numpy-only方法明显胜出，所以有另一个理由使用它而不是通过pandas进行操作。 - Pranav Hosangadi

4个回答

3

使用 np.bincount 和 np.unique 函数：

sums = np.bincount(A[:, 1].astype(np.int64), weights=A[:, 0])
values, counts = np.unique(A[:, 1], return_counts=True)
res = np.vstack((sums / counts, values)).T
print(res)

输出

[[0.6495549  0.        ]
 [0.26953094 1.        ]
 [0.71531777 2.        ]
 [0.41882167 3.        ]
 [0.52428582 4.        ]
 [0.63260239 5.        ]]

- Dani Mesejo

2

这里有一个使用numpy的解决方法，可以绕过问题。

unique_ys, indices = np.unique(A[:, 1], return_inverse=True)
result = np.empty((unique_ys.shape[0], 2))

for i, y in enumerate(unique_ys):
    result[i, 0] = np.mean(A[indices == i, 0])
    result[i, 1] = y

print(result)

替代方案：
为了使代码更符合Python风格，您可以使用列表推导式来创建result数组，而不是使用for循环。

unique_ys, indices = np.unique(A[:, 1], return_inverse=True)
result = np.array([[np.mean(A[indices == i, 0]), y] for i, y in enumerate(unique_ys)])

print(result)

输出：

[[0.6495549  0.        ]
 [0.26953094 1.        ]
 [0.71531777 2.        ]
 [0.41882167 3.        ]
 [0.52428582 4.        ]
 [0.63260239 5.        ]]

- Jamiu S.

0

如果您事先知道y值，可以尝试为每个数组匹配：

例如：

A[(A[:,1]==1),0]将为您提供所有y值等于1的x值。

因此，您可以遍历每个y值，对A[:,1]==y[n]求和以获取匹配数，对匹配的x值求和，除以数量得到平均值，并放置在新数组中：

B=np.zeros([6,2])

for i in range( 6):
    nmatch=sum(A[:,1]==i)
    nsum=sum(A[(A[:,1]==i),0])
    
    B[i,0]=i
    B[i,1]=nsum/nmatch

一定有更符合 Python 风格的方法来实现这个……

- XaC

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Pranav Hosangadi · Accepted Answer

这是一个完全矢量化的解决方案，只使用numpy方法而没有Python迭代：

sort_indices = np.argsort(A[:, 1])
unique_y, unique_indices, group_count  = np.unique(A[sort_indices, 1], return_index=True, return_counts=True)

一旦我们获得了所有唯一元素的索引和计数，我们就可以使用np.ufunc.reduceat方法收集每个组中np.add的结果，然后除以它们的计数获取平均值。

group_sum = np.add.reduceat(A[sort_indices, :], unique_indices, axis=0)

group_mean = group_sum / group_count[:, None]
# array([[0.6495549 , 0.        ],
#        [0.26953094, 1.        ],
#        [0.71531777, 2.        ],
#        [0.41882167, 3.        ],
#        [0.52428582, 4.        ],
#        [0.63260239, 5.        ]])

基准测试：

与此处其他答案（tio.run上的代码）相比，这个解决方案是关于编程的。

A包含10k行，A [:，1]包含N个组，N从1到10k不等。 Timing for different methods with 10k rows, N groups

A包含N行（N从1到10k不等），其中A [:，1]包含最多min(N, 1000)个组。 Timing for different methods with N rows, 1k groups

观察结果: 仅使用numpy的解决方案（Dani's和我的）轻松获胜-它们比pandas方法快得多（可能是因为创建数据框所花费的时间是前者不存在的开销）。

pandas的解决方案比python+numpy的解决方案（Jaimu和我的）在较小的数组上更慢，因为在Python中进行迭代并完成比先创建数据框要快，但是随着数组大小或组数的增加，这些解决方案比pandas慢得多。

注意：此答案的先前版本迭代了由Is there any numpy group by function?接受的答案返回的组，并逐个计算平均值：

首先，我们需要按您要分组的列对数组进行排序。

A_s = A[A[:, 1].argsort(), :]

接着，运行该片段。 np.split 将其第一个参数在第二个参数给定的索引处分割。

unique_elems, unique_indices = np.unique(A_s[:, 1], return_index=True) 
# (array([0., 1., 2., 3., 4., 5.]), array([ 0,  1,  2,  3,  9, 12])) 

split_indices = unique_indices[1:] # No need to split at the first index

groups = np.split(A_s, split_indices)
# [array([[0.6495549, 0.       ]]),
#  array([[0.26953094, 1.        ]]),
#  array([[0.71531777, 2.        ]]),
#  array([[0.33703753, 3.        ],
#         [0.93230994, 3.        ],
#         [0.08084283, 3.        ],
#         [0.07880787, 3.        ],
#         [0.62214452, 3.        ],
#         [0.4617873 , 3.        ]]),
#  array([[0.03501083, 4.        ],
#         [0.69253184, 4.        ],
#         [0.84531478, 4.        ]]),
#  array([[0.90115394, 5.        ],
#         [0.91172016, 5.        ],
#         [0.08493308, 5.        ]])]

现在，groups 是一个包含多个 np.array 的列表。遍历该列表并对每个数组进行 mean 操作：

means = np.zeros((len(groups), groups[0].shape[1]))
for i, grp in enumerate(groups):
    means[i, :] = grp.mean(axis=0)

# array([[0.6495549 , 0.        ],
#        [0.26953094, 1.        ],
#        [0.71531777, 2.        ],
#        [0.41882167, 3.        ],
#        [0.52428582, 4.        ],
#        [0.63260239, 5.        ]])