在NumPy数组中找到最常出现的数字

Question

在NumPy数组中找到最常出现的数字

183

假设我有以下NumPy数组：

a = np.array([1,2,3,1,2,1,1,1,3,2,2,1])

如何找出这个数组中出现最频繁的数字？

- JustInTime

有关Python列表，请参见在列表中查找最常见的元素和在列表中查找具有最大出现次数的项。 - Georgy

14个回答

137

您可以使用

values, counts = np.unique(a, return_counts=True)

ind = np.argmax(counts)
print(values[ind])  # prints the most frequent element

ind = np.argpartition(-counts, kth=10)[:10]
print(values[ind])  # prints the 10 most frequent elements

如果有一个元素和另一个元素一样频繁，这段代码将只返回第一个元素。

- Apogentus

9

我认为这是最有帮助的，因为它通用、简洁，并且允许通过某个衍生索引从值或计数中提取元素。 - ryanjdillon

7

如果存在多个最频繁的值，values[counts.argmax()] 将返回第一个值。要获取所有这些值，我们可以使用 values[counts == counts.max()]。 - W. Zhu

53

如果你愿意使用SciPy：

>>> from scipy.stats import mode
>>> mode([1,2,3,1,2,1,1,1,3,2,2,1])
(array([ 1.]), array([ 6.]))
>>> most_frequent = mode([1,2,3,1,2,1,1,1,3,2,2,1])[0][0]
>>> most_frequent
1.0

- Fred Foo

36

使用iPython对此处找到的某些解决方案进行性能测试：

>>> # small array
>>> a = [12,3,65,33,12,3,123,888000]
>>> 
>>> import collections
>>> collections.Counter(a).most_common()[0][0]
3
>>> %timeit collections.Counter(a).most_common()[0][0]
100000 loops, best of 3: 11.3 µs per loop
>>> 
>>> import numpy
>>> numpy.bincount(a).argmax()
3
>>> %timeit numpy.bincount(a).argmax()
100 loops, best of 3: 2.84 ms per loop
>>> 
>>> import scipy.stats
>>> scipy.stats.mode(a)[0][0]
3.0
>>> %timeit scipy.stats.mode(a)[0][0]
10000 loops, best of 3: 172 µs per loop
>>> 
>>> from collections import defaultdict
>>> def jjc(l):
...     d = defaultdict(int)
...     for i in a:
...         d[i] += 1
...     return sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]
... 
>>> jjc(a)[0]
3
>>> %timeit jjc(a)[0]
100000 loops, best of 3: 5.58 µs per loop
>>> 
>>> max(map(lambda val: (a.count(val), val), set(a)))[1]
12
>>> %timeit max(map(lambda val: (a.count(val), val), set(a)))[1]
100000 loops, best of 3: 4.11 µs per loop
>>>

最佳选择是使用类似问题的小型数组中的“max”和“set”，以获得更好的性能。

根据@David Sanders的说法，如果将数组大小增加到例如100,000个元素，则“max w/set”算法最终变得最糟糕，而“numpy bincount”方法则最佳。

- iuridiniz

1

@IuliusCurt为了指出最佳方法，我们需要对它进行多种情况的测试：小数组、大数组、随机数组、真实世界的数组（例如排序中的timsort）... 但我同意你的看法。 - iuridiniz

4

你的方法只使用了一个小数组，这并不能很好地区分不同的算法。 - David Sanders

12

如果你将测试列表的大小增加到100000 (a = (np.random.rand(100000) * 1000).round().astype('int'); a_list = list(a)), 那么你的“max w/set”算法表现最差，而“numpy bincount”方法表现最佳。我使用a_list进行本地Python代码测试，而使用a进行Numpy代码测试，以避免编组成本影响结果。 - David Sanders

9

从 Python 3.4 开始，标准库就包含了 statistics.mode 函数，用于返回单个最常见的数据点。

from statistics import mode

mode([1, 2, 3, 1, 2, 1, 1, 1, 3, 2, 2, 1])
# 1

如果有多个频率相同的模式，statistics.mode返回第一个遇到的模式。

从Python 3.8开始，statistics.multimode函数会按它们首次出现的顺序返回最常出现的值的列表：

from statistics import multimode

multimode([1, 2, 3, 1, 2])
# [1, 2]

- Xavier Guihot

5

如果您想获取最常见的值（正数或负数），而不加载任何模块，可以使用以下代码：

lVals = [1,2,3,1,2,1,1,1,3,2,2,1]
print max(map(lambda val: (lVals.count(val), val), set(lVals)))

- Artsiom Rudzenka

2

这是一段时间以前的内容，但为了记录：这相当于更易读的max(set(lVals), key=lVals.count)，它对lVals的每个唯一元素进行O(n)计数，大约为O(n^2)（假设有O(n)个唯一元素）。使用标准库中的collections.Counter(lVals).most_common(1)[0][0]，正如JoshAdel所建议的那样，只需要O(n)。 - Danica

4

虽然以上大多数答案都很有用，但是如果您： 1）需要支持非正整数值（例如浮点数或负整数;-）， 2）没有使用Python 2.7（collections.Counter所需）， 3）不想将scipy（甚至numpy）作为代码的依赖项添加到您的代码中，则仅在Python 2.6中具有O（nlogn）（即高效）的解决方案就是这样做：

from collections import defaultdict

a = [1,2,3,1,2,1,1,1,3,2,2,1]

d = defaultdict(int)
for i in a:
  d[i] += 1
most_frequent = sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]

- JJC

4

在Python 3中，以下代码应该可行：

max(set(a), key=lambda x: a.count(x))

- Yury Kliachko

2

我喜欢JoshAdel提出的解决方案。

但是有一个小问题。

np.bincount()解决方案仅适用于数字。

如果你有字符串，collections.Counter解决方案适用于你。

- Vikas

2

这里有一个通用的解决方案，可以沿着一个轴应用，无论数值如何，只使用numpy。我还发现，如果存在许多唯一值，则比scipy.stats.mode快得多。

import numpy

def mode(ndarray, axis=0):
    # Check inputs
    ndarray = numpy.asarray(ndarray)
    ndim = ndarray.ndim
    if ndarray.size == 1:
        return (ndarray[0], 1)
    elif ndarray.size == 0:
        raise Exception('Cannot compute mode on empty array')
    try:
        axis = range(ndarray.ndim)[axis]
    except:
        raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))

    # If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
    if all([ndim == 1,
            int(numpy.__version__.split('.')[0]) >= 1,
            int(numpy.__version__.split('.')[1]) >= 9]):
        modals, counts = numpy.unique(ndarray, return_counts=True)
        index = numpy.argmax(counts)
        return modals[index], counts[index]

    # Sort array
    sort = numpy.sort(ndarray, axis=axis)
    # Create array to transpose along the axis and get padding shape
    transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
    shape = list(sort.shape)
    shape[axis] = 1
    # Create a boolean array along strides of unique values
    strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
                                 numpy.diff(sort, axis=axis) == 0,
                                 numpy.zeros(shape=shape, dtype='bool')],
                                axis=axis).transpose(transpose).ravel()
    # Count the stride lengths
    counts = numpy.cumsum(strides)
    counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
    counts[strides] = 0
    # Get shape of padded counts and slice to return to the original shape
    shape = numpy.array(sort.shape)
    shape[axis] += 1
    shape = shape[transpose]
    slices = [slice(None)] * ndim
    slices[axis] = slice(1, None)
    # Reshape and compute final counts
    counts = counts.reshape(shape).transpose(transpose)[slices] + 1

    # Find maximum counts and return modals/counts
    slices = [slice(None, i) for i in sort.shape]
    del slices[axis]
    index = numpy.ogrid[slices]
    index.insert(axis, numpy.argmax(counts, axis=axis))
    return sort[index], counts[index]

- Devin Cairns

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- JoshAdel · Accepted Answer

如果您的列表包含所有非负整数，您应该查看numpy.bincounts：

http://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html

然后可能使用np.argmax：

a = np.array([1,2,3,1,2,1,1,1,3,2,2,1])
counts = np.bincount(a)
print(np.argmax(counts))

如果需要处理一个更加复杂的列表（可能包含负数或非整数值），可以类似地使用np.histogram。另外，如果您只想使用Python而不使用NumPy，则collections.Counter是处理这种数据的一种好方式。

from collections import Counter
a = [1,2,3,1,2,1,1,1,3,2,2,1]
b = Counter(a)
print(b.most_common(1))