在一个数组中找到重复元素的索引（Python，NumPy）

Question

在一个数组中找到重复元素的索引（Python，NumPy）

9

假设我有一个整数的NumPy数组，例如：

[34,2,3,22,22,22,22,22,22,18,90,5,-55,-19,22,6,6,6,6,6,6,6,6,23,53,1,5,-42,82]

我希望能够找到数组中某个值重复超过x次（比如5次）的起始和结束索引。在上面的例子中，这个值是22和6。重复的22的起始索引是3，结束索引是8。重复的6也是一样。 Python中是否有特殊的工具可以帮助我完成这个任务？否则，我需要用循环遍历数组中的每个索引，并将其与前一个值进行比较。

谢谢。

- mcatis

@Evan：我认为这并不适用：mode可以在数组中的任何值上工作，而不一定是连续的。 - Prune

这个问题与Python中的任何序列容器都非常相关，而不仅仅是NumPy数组。 - Liam Bohl

6个回答

2

这里有一个使用Python原生的 itertools 解决方案。

代码：

import itertools as it


def find_ranges(lst, n=2):
    """Return ranges for `n` or more repeated values."""
    groups = ((k, tuple(g)) for k, g in it.groupby(enumerate(lst), lambda x: x[-1]))
    repeated = (idx_g for k, idx_g in groups if len(idx_g) >=n)
    return ((sub[0][0], sub[-1][0]) for sub in repeated)

lst = [34,2,3,22,22,22,22,22,22,18,90,5,-55,-19,22,6,6,6,6,6,6,6,6,23,53,1,5,-42,82]    
list(find_ranges(lst, 5))
# [(3, 8), (15, 22)]

测试

import nose.tools as nt


def test_ranges(f):
    """Verify list results identifying ranges."""
    nt.eq_(list(f([])), [])
    nt.eq_(list(f([0, 1,1,1,1,1,1, 2], 5)), [(1, 6)])
    nt.eq_(list(f([1,1,1,1,1,1, 2,2, 1, 3, 1,1,1,1,1,1], 5)), [(0, 5), (10, 15)])
    nt.eq_(list(f([1,1, 2, 1,1,1,1, 2, 1,1,1], 3)), [(3, 6), (8, 10)])    
    nt.eq_(list(f([1,1,1,1, 2, 1,1,1, 2, 1,1,1,1], 3)), [(0, 3), (5, 7), (9, 12)])

test_ranges(find_ranges)

这个例子捕获了lst中的（index，element）对，并将它们按元素分组。仅保留重复的对。最后，切片第一个和最后一个对，得到每个重复组的（start，end）索引。

另请参见此帖子，了解使用itertools.groupby查找索引范围的方法。

- pylang

嗨，你能详细解释一下 lambda x: x[-1] 的作用吗？谢谢。 - undefined

enumerate(lst) 将(index, element_from_lst)对传递给groupby()函数。lambda x: x[-1] 告诉groupby()函数：“根据每个对中的元素进行分组”。这相当于一个常规函数，例如 def use_elem(x): return x[-1]。 - undefined

1

对于这个问题真的没有很好的捷径。你可以尝试以下方法：

mult = 5
for elem in val_list:
    target = [elem] * mult
    found_at = val_list.index(target)

我将未找到的异常和较长的序列检测留给你处理。

- Prune

0

如果你想在列表 L 中重复 n 次寻找 value，你可以像这样做：

def find_repeat(value, n, L):
    look_for = [value for _ in range(n)]
    for i in range(len(L)):
        if L[i] == value and L[i:i+n] == look_for:
            return i, i+n

- KAL

0

这里是一个相对快速、无误的解决方案，还可以告诉您运行中有多少份副本。其中一些代码借鉴自KAL的解决方案。

# Return the start and (1-past-the-end) indices of the first instance of
# at least min_count copies of element value in container l 
def find_repeat(value, min_count, l):
  look_for = [value for _ in range(min_count)]
  for i in range(len(l)):
    count = 0
    while l[i + count] == value:
      count += 1
    if count >= min_count:
      return i, i + count

- Liam Bohl

0

我有类似的需求。这是我使用推导列表想出来的：

A=[34,2,3,22,22,22,22,22,22,18,90,5,-55,-19,22,6,6,6,6,6,6,6,6,23,53,1,5,-42,82]

查找唯一值并返回其索引

_, ind = np.unique(A,return_index=True)

np.unique对数组进行排序，对索引进行排序以获得原始顺序的索引

ind = np.sort(ind)

ind 包含由非连续索引可见的重复组中第一个元素的索引。他们的 diff 给出了组中的元素数量。使用 np.diff(ind)>5 进行过滤，将产生一个布尔数组，在组的起始索引处具有 True。数组 ind 包含每个组的结束索引，紧接在过滤列表中的每个 True 之后。

创建一个字典，以重复元素作为键，将值作为该组的起始和结束索引的元组。

rep_groups = dict((A[ind[i]], (ind[i], ind[i+1]-1)) for i,v in enumerate(np.diff(ind)>5) if v)

- Richard Macwan

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- EFT · Accepted Answer

使用np.diff和@WarrenWeckesser在这里给出的方法来查找数组中连续的零的长度：

import numpy as np

def zero_runs(a):  # from link
    iszero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
    absdiff = np.abs(np.diff(iszero))
    ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
    return ranges

a = [34,2,3,22,22,22,22,22,22,18,90,5,-55,-19,22,6,6,6,6,6,6,6,6,23,53,1,5,-42,82]

zero_runs(np.diff(a))
Out[87]: 
array([[ 3,  8],
       [15, 22]], dtype=int32)

这可以通过筛选运行开始和结束之间的差异来实现：

runs = zero_runs(np.diff(a))

runs[runs[:, 1]-runs[:, 0]>5]  # runs of 7 or more, to illustrate filter
Out[96]: array([[15, 22]], dtype=int32)