在两个二维数组中查找匹配行的索引。

Question

在两个二维数组中查找匹配行的索引。

12

假设我有两个2-D数组如下：

array([[3, 3, 1, 0],
       [2, 3, 1, 3],
       [0, 2, 3, 1],
       [1, 0, 2, 3],
       [3, 1, 0, 2]], dtype=int8)

array([[0, 3, 3, 1],
       [0, 2, 3, 1],
       [1, 0, 2, 3],
       [3, 1, 0, 2],
       [3, 3, 1, 0]], dtype=int8)

每个数组中的一些行与另一个数组中相应的行按值（但不一定是按索引）匹配，而另一些则不匹配。

我希望找到一种高效的方法来返回两个数组中对应匹配行的索引对。如果它们是元组，我将期望返回

(0,4)
(2,1)
(3,2)
(4,3)

- llevar

3个回答

6

我想不到一种numpy特定的方法来做这件事，但以下是我使用普通列表的做法:

>>> L1= [[3, 3, 1, 0],
...        [2, 3, 1, 3],
...        [0, 2, 3, 1],
...        [1, 0, 2, 3],
...        [3, 1, 0, 2]]
>>> L2 = [[0, 3, 3, 1],
...        [0, 2, 3, 1],
...        [1, 0, 2, 3],
...        [3, 1, 0, 2],
...        [3, 3, 1, 0]]
>>> L1 = {tuple(row):i for i,row in enumerate(L1)}
>>> answer = []
>>> for i,row in enumerate(L2):
...   if tuple(row) in L1:
...     answer.append((L1[tuple(row)], i))
... 
>>> answer
[(2, 1), (3, 2), (4, 3), (0, 4)]

- inspectorG4dget

O(n)! 很好。但是没有 numpy 的方法吗？ - slider

@slider：我想不出一种使用Numpy的方法来完成这件事，主要是因为我没有经常使用Numpy（它已经在我的待办清单上待了太长时间，让我感到有些自豪）。 - inspectorG4dget

这能推广到 L2 只有一行的情况，并且我们想要获取L1中匹配行的“行索引”，而L1中的行不一定是唯一的吗？ - sodiumnitrate

6

您可以使用void数据类型技巧在两个数组的行上使用1D函数。a_view和b_view是1D向量，每个条目代表一个完整的行。然后，我选择对一个数组进行排序，并使用np.searchsorted在其中查找另一个数组的项目。如果我们要排序的数组长度为m，另一个数组的长度为n，则排序需要时间为m * log(m)，np.searchsorted执行的二进制搜索需要时间为n * log(m)，总共需要时间为(n + m) * log(m)。因此，您需要对两个数组中最短的那个进行排序：

def find_rows(a, b):
    dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))

    a_view = np.ascontiguousarray(a).view(dt).ravel()
    b_view = np.ascontiguousarray(b).view(dt).ravel()

    sort_b = np.argsort(b_view)
    where_in_b = np.searchsorted(b_view, a_view,
                                 sorter=sort_b)
    where_in_b = np.take(sort_b, where_in_b)
    which_in_a = np.take(b_view, where_in_b) == a_view
    where_in_b = where_in_b[which_in_a]
    which_in_a = np.nonzero(which_in_a)[0]
    return np.column_stack((which_in_a, where_in_b))

使用a和b作为示例数组:

In [14]: find_rows(a, b)
Out[14]: 
array([[0, 4],
       [2, 1],
       [3, 2],
       [4, 3]], dtype=int64)

In [15]: %timeit find_rows(a, b)
10000 loops, best of 3: 29.7 us per loop

在我的系统上，针对您的测试数据，使用字典方法的速度约为22微秒，但是对于1000x4的数组，这种numpy方法比纯Python方法快了约6倍（483微秒对2.54毫秒）。

- Jaime

这太棒了。我花了整整一个小时才明白你在做什么。虽然有一个小错误，因为searchsorted有可能返回应该将项目插入到末尾的结果，这会导致索引超出范围的错误。 - Dalupus

只需将数组a的最后一行更改为[3,3,3,3]，您就会得到“IndexError：index 5 is out of bounds for size 5”。 - Dalupus

2

这真的加快了我的代码速度，非常感谢。对于10^4或更多行，仅在行上使用dict()是无法满足需求的。 - Kurt

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- hpaulj · Accepted Answer

这是一个全为numpy的解决方案 - 并不一定比迭代式的Python更好。它仍然必须查看所有组合。

In [53]: np.array(np.all((x[:,None,:]==y[None,:,:]),axis=-1).nonzero()).T.tolist()
Out[53]: [[0, 4], [2, 1], [3, 2], [4, 3]]

中间数组是(5,5,4)。使用np.all将其简化为：

array([[False, False, False, False,  True],
       [False, False, False, False, False],
       [False,  True, False, False, False],
       [False, False,  True, False, False],
       [False, False, False,  True, False]], dtype=bool)

剩下的工作就是提取这个条件为“True”的索引。

简单测试结果显示，本方法耗时47.8微秒；而使用“L1”字典的另一个答案则为38.3微秒；第三种方法采用双重循环，耗时496微秒。