Python：比较数组元素的最快方法

Question

Python：比较数组元素的最快方法

5

我正在寻找在Python中输出两个数组第一个不同之处的最快方法。例如，让我们看下面这两个数组：

test1 = [1, 3, 5, 8]
test2 = [1]
test3 = [1, 3]

比较test1和test2，我想输出1，而比较test1和test3应该输出2。

换句话说，我要寻找与以下语句等价的语句：

import numpy as np
np.where(np.where(test1 == test2, test1, 0) == '0')[0][0]

具有不同数组长度的情况。

非常感谢您的帮助。

- Andy

不，这些元素没有被排序。我到目前为止尝试了各种numpy语句。 - Andy

它们实际上是numpy数组还是列表？ - Padraic Cunningham

最初它们是列表。 - Andy

7个回答

4

使用NumPy数组（对于大型数组，速度更快）可以检查列表的长度，然后检查重叠部分，类似以下内容（显然将较长的切片为较短的长度）：

import numpy as np

n = min(len(test1), len(test2))
x = np.where(test1[:n] != test2[:n])[0]
if len(x) > 0:
  ans = x[0]
elif len(test1) != len(test2):
  ans = n
else:
  ans = None

编辑 - 尽管这个回答被投票否决，但我会把我的答案留在这里，以防其他人需要做类似的事情。

如果起始数组很大并且使用numpy，则这是最快的方法。此外，我不得不修改安迪的代码才能使其正常工作。按顺序：1. 我的建议，2. Paidric的（现已删除，但最优雅），3. 安迪的被接受的答案，4. zip-非numpy，5. 没有zip的vanilla python，如@leekaiinthesky所述。

0.1ms，9.6ms，0.6ms，2.8ms，2.3ms

如果将转换为ndarray包含在timeit中，则非numpy nop-zip方法最快

7.1ms，17.1ms，7.7ms，2.8ms，2.3ms

如果两个列表之间的差异在大约1,000而不是10,000的索引处，则更是如此

7.1ms，17.1ms，7.7ms，0.3ms，0.2ms

import timeit

setup = """
import numpy as np
from itertools import zip_longest
list1 = [1 for i in range(10000)] + [4, 5, 7]
list2 = [1 for i in range(10000)] + [4, 4]
test1 = np.array(list1)
test2 = np.array(list2)

def find_first_diff(l1, l2):
    for index, (x, y) in enumerate(zip_longest(l1, l2, fillvalue=object())):
        if x != y:
            return index

def findFirstDifference(list1, list2):
  minLength = min(len(list1), len(list2))
  for index in range(minLength):
    if list1[index] != list2[index]:
      return index
  return minLength
"""

fn = ["""
n = min(len(test1), len(test2))
x = np.where(test1[:n] != test2[:n])[0]
if len(x) > 0:
  ans = x[0]
elif len(test1) != len(test2):
  ans = n
else:
  ans = None""",
"""
x = np.where(np.in1d(list1, list2) == False)[0]
if len(x) > 0:
  ans = x[0]
else:
  ans = None""",
"""
x = test1
y = np.resize(test2, x.shape)
x = np.where(np.where(x == y, x, 0) == 0)[0]
if len(x) > 0:
  ans = x[0]
else:
  ans = None""",
"""
ans = find_first_diff(list1, list2)""",
"""
ans = findFirstDifference(list1, list2)"""]

for f in fn:
  print(timeit.timeit(f, setup, number = 1000))

- paddyg

谢谢！对于时间比较，加一分。 - leekaiinthesky

1

最快的算法将比较每个元素直到第一个不同为止。因此，像这样成对迭代两个列表会给出以下结果：

def findFirstDifference(list1, list2):
  minLength = min(len(list1), len(list2))
  for index in xrange(minLength):
    if list1[index] != list2[index]:
      return index
  return minLength # the two lists agree where they both have values, so return the next index

这将输出您想要的结果：

print findFirstDifference(test1, test3)
> 2

- leekaiinthesky

如果第一个列表比第二个列表短，这种方法并不总是有效。 - L3viathan

1

Here one way to do it:

from itertools import izip
def compare_lists(lista, listb):
    """
    Compare two lists and return the first index where they differ. if
    they are equal, return the list len
    """
    for position, (a, b) in enumerate(zip(lista, listb)):
        if a != b:
            return position
    return min([len(lista), len(listb)])

算法很简单：将两个列表压缩（或者在这种情况下，使用更高效的izip），然后逐个元素进行比较。
eumerate函数提供了索引位置，如果发现差异，我们可以返回该位置
如果我们在没有任何返回的情况下退出for循环，则会发生以下两种可能性之一：
1. 两个列表相同。在这种情况下，我们要返回任意一个列表的长度。
2. 列表的长度不同，并且它们相等，直到较短列表的长度。在这种情况下，我们要返回较短列表的长度
在这两种情况下，min(...)表达式就是我们想要的。
这个函数有一个bug：如果你比较两个空列表，它会返回0，这似乎是错误的。我把它留给你作为练习来修复。

- Hai Vu

0

感谢您提供的所有建议，我刚刚找到了一个更简单的解决方案，即：

x = numpy.array(test1)
y = np.resize(numpy.array(test2), x.shape)
np.where(np.where(x == y, x, 0) == '0')[0][0]

- Andy

е°қиҜ•дҪҝз”ЁжӮЁзҡ„ж–№жі•жөӢиҜ•test1 = [3, 4, 3, 5]е’Ңtest2 = [3, 4]гҖӮиҝҳиҰҒжЈҖжҹҘtest1 = [3, 4, 3, 4]е’Ңtest2 = [3, 4]гҖӮ - Mike Müller

0

这里是一个不太符合Python风格且没有使用NumPy的尝试：

b = zip (test1, test2)
c = 0
while b:        
    b = b[1:]
    if not b or b[0][0] != b[0][1]:
        break
    else:
        c = c + 1
print c

- Paulo Mendes

0

对于 Python 3.x：

  def first_diff_index(ls1, ls2):
    l = min(len(ls1), len(ls2)) 
    return next((i for i in range(l) if ls1[i] != ls2[i]), l)

（对于 Python 2.7 及以上版本，请将 range 替换为 xrange）

- pjm

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mike Müller · Accepted Answer

对于列表，这个功能是有效的：

from itertools import zip_longest

def find_first_diff(list1, list2):
    for index, (x, y) in enumerate(zip_longest(list1, list2, 
                                               fillvalue=object())):
        if x != y:
            return index

zip_longest函数会用None或者提供的填充值来填充较短的列表。标准的zip函数在两个列表长度不同时无法正常工作，而zip_longest函数可以处理这种情况。

在Python 2中，请使用izip_longest函数。

更新：为避免None作为列表值可能出现的问题，创建了独特的填充值object()。

>>> o1 = object()
>>> o2 = object()
>>> o1 == o2
False

这种纯Python方法可能比NumPy解决方案更快。这取决于实际数据和其他情况。

将列表转换为NumPy数组也需要时间。这可能比使用上述函数找到索引要花费更长的时间。如果您不会在其他计算中使用NumPy数组，则转换可能会导致相当大的开销。
NumPy总是搜索整个数组。如果差异出现得很早，那么你做了比需要更多的工作。
NumPy创建了一堆中间数组。这会耗费内存和时间。
NumPy需要构造具有最大长度的中间数组。在这里比较许多小的与非常大的数组不利。

通常，在许多情况下，NumPy比纯Python解决方案更快。但每种情况都有所不同，存在纯Python更快的情况。