用字典中的频率/值替换NumPy数组条目

Question

用字典中的频率/值替换NumPy数组条目

4

问题：从两个输入数组中，我想输出一个数组，其中包含与input_1的每个值对应的True值（来自input_2）的频率。

import numpy as np   # import everything from numpy
from scipy.stats import itemfreq
input_1 = np.array([3,6,6,3,6,4])
input_2 = np.array([False, True, True, False, False, True])

这个例子的输出结果是：

output_1 = np.array([0,2,2,0,2,1])

我的当前方法涉及编辑input_1，只保留与True相对应的值：

locs=np.where(input_2==True,input_1,0)

然后统计每个答案的频率，创建一个字典，并将输入1中适当的键替换为值（即真正的频率）。

loc_freq = itemfreq(locs)
dic = {}
for key,val in loc_freq:
    dic[key]=val
print dic
for k, v in dic.iteritems():
    input_1[input_1==k]=v

这里的问题有两个： 1）这仍然没有处理字典中不存在的键（因此应该将其更改为0）。例如，如何将3转换为0？ 2）这种方法似乎非常不优雅/低效。是否有更好的方法来解决这个问题？

输出结果是[3,2,2,3,2,1]。

- Janis Strods

3个回答

3

@memecs的解决方案是正确的，+1。然而，如果input_1中的值非常大，即它们不是数组的索引，而是秒或其他可以取很大值的整数数据，那么这种方法将非常缓慢并且占用大量内存。

在这种情况下，您可以通过np.bincount(input_1[input_2]).size来确定具有input_2中True值的最大整数在input_1数组中出现的次数。

使用unique和bincount会更快。我们使用第一个函数提取input_1中唯一元素的索引，然后使用bincount计算该同一数组中这些索引出现的次数，并根据数组input_2的值(True或False)将它们加权为1或0：

# extract unique elements and the indices to reconstruct the array
unq, idx = np.unique(input_1, return_inverse=True)
# calculate the weighted frequencies of these indices
freqs_idx = np.bincount(idx, weights=input_2)
# reconstruct the array of frequencies of the elements
frequencies = freqs_idx[idx]
print(frequencies)

这个解决方案非常快，并且对内存的影响最小。感谢 @Jaime，可以看到他下面的评论。下面我将报告我的原始答案，以不同的方式使用 unique。

另一种可能性

使用 unique 可能更快:

import numpy as np
input_1 = np.array([3, 6, 6, 3, 6, 4])
input_2 = np.array([False, True, True, False, False, True])

non_zero_hits, counts = np.unique(input_1[input_2], return_counts=True)
all_hits, idx = np.unique(input_1, return_inverse=True)
frequencies = np.zeros_like(all_hits)

#2nd step, with broadcasting
idx_non_zero_hits_in_all_hits = np.where(non_zero_hits[:, np.newaxis] - all_hits == 0)[1]
frequencies[idx_non_zero_hits_in_all_hits] = counts
print(frequencies[idx])

这种方法的缺点在于，如果 input_1 中唯一元素的数量很多且值为 True 的元素也很多，则会需要大量的内存，因为需要创建并传递给 where 的是一个二维数组。为了减少内存占用，您可以使用 for 循环来替代算法的第二步：

#2nd step, but with a for loop.
for j, val in enumerate(non_zero_hits):
    index = np.where(val == all_hits)[0]
    frequencies[index] = counts[j]
print(frequencies[idx])

这种第二种解决方案的内存占用很小，但需要使用 for 循环。哪种解决方案最优取决于您的典型数据输入大小和值。

- gg349

1

好主意！内存占用可能会很大。在使用np.bincount之前重新标记数据，使得min=0且max < len(input_1)也是一个选择。（参见skimage.segmentation.relabel_sequential(..)） - memecs

1

你的顾虑非常合理，但正确的做法不是建立一个二维数组，而是使用 np.bincount 的 weights 参数：unq, idx = np.unique(array_1, return_inverse=True); freqs = np.bincount(idx, weights=array2)[idx] 将为您提供快速、紧凑且内存高效的实现。 - Jaime

@Jaime，我确实相信一定有更好的方法来解决这个问题，并期待着你发布更好的答案 :-) 如果你有时间，请发布它，否则我会稍后自己发布一个社区答案。 - gg349

1

请随意使用我的评论，并根据您的需要将其添加到您的答案中：这只是相同主题的变化。 - Jaime

0

目前被接受的bincount解决方案相当优雅，但numpy_indexed包提供了更一般的解决这类问题的方法：

import numpy_indexed as npi
idx = npi.as_index(input_1)
unique_labels, true_count_per_label = npi.group_by(idx).sum(input_2)
print(true_count_per_label[idx.inverse])

- Eelco Hoogendoorn

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- memecs · Accepted Answer

np.bincount 是你要寻找的函数。

output_1 = np.bincount(input_1[input_2])[input_1]