NumPy数组求和缩减。

Question

NumPy数组求和缩减。

3

我有一个形如下列三列的numpy数组：

x1 y1 f1


x2 y2 f2


...

xn yn fn

(x,y)对可能会重复出现。我需要另一个数组，使得每个(x,y)对只出现一次，相应的第三列是所有出现在(x,y)旁边的f值的总和。

例如，给定数组

会给予

行的顺序并不重要。在Python中最快的方法是什么？

谢谢！

- Botond

1

请查看 at: http://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.at.html - hpaulj

我会使用Pandas数据框架。 - reptilicus

@hpaulj，这个很好用！ - Botond

4个回答

1

感谢@hpaulj，终于找到了最简单的解决方案。如果d包含三列数据：

ind =d[0:2].astype(int)
x = zeros(shape=(N,N))
add.at(x,list(ind),d[2])

这个解决方案假设第一列和第二列中的(x,y)索引是整数且小于N。这是我需要的，也应该在帖子中提到。

编辑：请注意，上述解决方案在矩阵中的位置（x，y）产生总和值的稀疏矩阵。

- Botond

也许进行一些运行时测试会很有趣，不是吗？ - Divakar

你确定这个能正常工作吗？我用问题中列出的输入尝试了一下，得到了一些意外的值。我假设N = 3。 - Divakar

2

这与问题中产生的输出远不相同。它会生成一个稀疏矩阵，其中包含x、y坐标处的总和。 - user688635

@Colt45，你说得对。但是从稀疏矩阵中恢复所需的输出形式很容易。虽然对于大N来说可能不是最优的。我完全承认我的问题没有表述清楚。 - Botond

0

在Python中肯定很容易实现：

arr = np.array([[1,2,4.0],
                [1,1,5.0],
                [1,2,3.0],
                [0,1,9.0]])
d={}                
for x, y, z in arr:
    d.setdefault((x,y), 0)
    d[x,y]+=z     

>>> d
{(1.0, 2.0): 7.0, (0.0, 1.0): 9.0, (1.0, 1.0): 5.0}

然后将其转换回numpy：

>>> np.array([[x,y,d[(x,y)]] for x,y in d.keys()]) 
array([[ 1.,  2.,  7.],
       [ 0.,  1.,  9.],
       [ 1.,  1.,  5.]])

- dawg

很好的字典解决方案，但是它涉及到一堆for循环。 - Botond

0

如果您安装了scipy，则可以使用稀疏模块执行此类加法操作 - 对于第1和第2列为整数（即索引）的数组。

from scipy import sparse
M = sparse.csr_matrix((d[:,0], (d[:,1],d[:,2])))
M = M.tocoo() # there may be a short cut to this csr coo round trip
x = np.column_stack([M.row, M.col, M.data]) # needs testing

为便于构建某些线性代数矩阵，csr 稀疏数组格式将具有重复索引的值相加。它是用编译代码实现的，因此应该相当快。但是把数据放入 M 并取回可能会使其变慢。（附言：由于我是在没有 scipy 的机器上编写此脚本，所以我还没有测试过它。）

- hpaulj

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Divakar · Accepted Answer

这是解决它的一种方法 -

import numpy as np

# Input array
A = np.array([[1,2,4.0],
             [1,1,5.0],
             [1,2,3.0],
             [0,1,9.0]])

# Extract xy columns            
xy = A[:,0:2]

# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(xy.T)
sorted_xy =  xy[sorted_idx,:]

# Differentiation along rows for sorted array
df1 = np.diff(sorted_xy,axis=0)
df2 = np.append([True],np.any(df1!=0,1),0)
# OR df2 = np.append([True],np.logical_or(df1[:,0]!=0,df1[:,1]!=0),0)
# OR df2 = np.append([True],np.dot(df1!=0,[True,True]),0)

# Get unique sorted labels
sorted_labels = df2.cumsum(0)-1

# Get labels
labels = np.zeros_like(sorted_idx)
labels[sorted_idx] = sorted_labels

# Get unique indices
unq_idx  = sorted_idx[df2]

# Get counts and unique rows and setup output array
counts = np.bincount(labels, weights=A[:,2])
unq_rows = xy[unq_idx,:]
out = np.append(unq_rows,counts.ravel()[:,None],1)

输入和输出 -

In [169]: A
Out[169]: 
array([[ 1.,  2.,  4.],
       [ 1.,  1.,  5.],
       [ 1.,  2.,  3.],
       [ 0.,  1.,  9.]])

In [170]: out
Out[170]: 
array([[ 0.,  1.,  9.],
       [ 1.,  1.,  5.],
       [ 1.,  2.,  7.]])