numpy矩阵的分组和平均值

Question

numpy矩阵的分组和平均值

15

假设我有一个任意的numpy矩阵，看起来像这样：

arr = [[  6.0   12.0   1.0]
       [  7.0   9.0   1.0]
       [  8.0   7.0   1.0]
       [  4.0   3.0   2.0]
       [  6.0   1.0   2.0]
       [  2.0   5.0   2.0]
       [  9.0   4.0   3.0]
       [  2.0   1.0   4.0]
       [  8.0   4.0   4.0]
       [  3.0   5.0   4.0]]

如何对按第三列数字分组的行进行高效平均？

期望输出为：

result = [[  7.0  9.33  1.0]
          [  4.0  3.0  2.0]
          [  9.0  4.0  3.0]
          [  4.33  3.33  4.0]]

- Algorithm

仅使用numpy且不使用循环：https://dev59.com/i6zka4cB1Zd3GeqP6U2N#66871328 - Marco Cerliani

4个回答

6

您可以做：

for x in sorted(np.unique(arr[...,2])):
    results.append([np.average(arr[np.where(arr[...,2]==x)][...,0]), 
                    np.average(arr[np.where(arr[...,2]==x)][...,1]),
                    x])

测试：

>>> arr
array([[  6.,  12.,   1.],
       [  7.,   9.,   1.],
       [  8.,   7.,   1.],
       [  4.,   3.,   2.],
       [  6.,   1.,   2.],
       [  2.,   5.,   2.],
       [  9.,   4.,   3.],
       [  2.,   1.,   4.],
       [  8.,   4.,   4.],
       [  3.,   5.,   4.]])
>>> results=[]
>>> for x in sorted(np.unique(arr[...,2])):
...     results.append([np.average(arr[np.where(arr[...,2]==x)][...,0]), 
...                     np.average(arr[np.where(arr[...,2]==x)][...,1]),
...                     x])
... 
>>> results
[[7.0, 9.3333333333333339, 1.0], [4.0, 3.0, 2.0], [9.0, 4.0, 3.0], [4.333333333333333, 3.3333333333333335, 4.0]]

< p > 数组arr不需要排序，所有中间数组都是视图（即不是新的数据数组）。平均值可以直接从这些视图有效地计算出来。

或者，对于一个纯numpy解决方案：

groups = arr[:,2].copy()

_ndx = np.argsort(groups)
_id, _pos, grp_count  = np.unique(groups[_ndx], 
                return_index=True, 
                return_counts=True)

grp_sum = np.add.reduceat(arr[_ndx], _pos, axis=0)
grp_mean = grp_sum / grp_count[:,None]  

>>> grp_mean
array([[7.        , 9.33333333, 1.        ],
       [4.        , 3.        , 2.        ],
       [9.        , 4.        , 3.        ],
       [4.33333333, 3.33333333, 4.        ]])

- dawg

我喜欢它，非常干净。我该如何将结果存储到numpy.array中呢？ - Algorithm

我发现最简单的方法就是将结果后续转换为任何给定类型。所以我有 results = np.asarray(results)，输出完美无缺。 - Algorithm

3

解决方案

from itertools import groupby
from operator import itemgetter

arr = [[6.0, 12.0, 1.0],
       [7.0, 9.0, 1.0],
       [8.0, 7.0, 1.0],
       [4.0, 3.0, 2.0],
       [6.0, 1.0, 2.0],
       [2.0, 5.0, 2.0],
       [9.0, 4.0, 3.0],
       [2.0, 1.0, 4.0],
       [8.0, 4.0, 4.0],
       [3.0, 5.0, 4.0]]

result = []

for groupByID, rows in groupby(arr, key=itemgetter(2)):
    position1, position2, counter = 0, 0, 0
    for row in rows:
        position1+=row[0]
        position2+=row[1]
        counter+=1
    result.append([position1/counter, position2/counter, groupByID])

print(result)

将输出：

[[7.0, 9.333333333333334, 1.0]]
[[4.0, 3.0, 2.0]]
[[9.0, 4.0, 3.0]]
[[4.333333333333333, 3.3333333333333335, 4.0]]

- DmitrySemenov

3

arr = np.array(
[[  6.0,   12.0,   1.0],
 [  7.0,   9.0,   1.0],
 [  8.0,   7.0,   1.0],
 [  4.0,   3.0,   2.0],
 [  6.0,   1.0,   2.0],
 [  2.0,   5.0,   2.0],
 [  9.0,   4.0,   3.0],
 [  2.0,   1.0,   4.0],
 [  8.0,   4.0,   4.0],
 [  3.0,   5.0,   4.0]])
np.array([a.mean(0) for a in np.split(arr, np.argwhere(np.diff(arr[:, 2])) + 1)])

- HYRY

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Eelco Hoogendoorn · Accepted Answer

一种简洁的解决方案是使用numpy_indexed（声明：我是它的作者），它实现了完全向量化的解决方案：

import numpy_indexed as npi
npi.group_by(arr[:, 2]).mean(arr)