二进制numpy数组之间快速汉明距离计算

Question

二进制numpy数组之间快速汉明距离计算

12

我有两个numpy数组，它们长度相同且包含二进制值。

import numpy as np
a=np.array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0])
b=np.array([1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1])

我希望能够尽可能快地计算它们之间的汉明距离，因为我需要进行数百万次距离计算。

一个简单但速度较慢的选项如下（摘自维基百科）：

我希望尽可能快速地计算它们之间的汉明距离，因为我有数百万个这样的距离计算任务。

一种简单但慢的选择是以下方法（来自维基百科）：

%timeit sum(ch1 != ch2 for ch1, ch2 in zip(a, b))
10000 loops, best of 3: 79 us per loop

我受到stack overflow上一些答案的启发，想出了更快的选择。

%timeit np.sum(np.bitwise_xor(a,b))
100000 loops, best of 3: 6.94 us per loop

%timeit len(np.bitwise_xor(a,b).nonzero()[0])
100000 loops, best of 3: 2.43 us per loop

我在想是否有更快的方法来计算这个值，可能可以使用Cython来实现？

- benbo

示例数组 a 和 b 的长度是否与您的真实数据长度相同？ - Warren Weckesser

2

你是在计算一个数组中的所有成对距离，还是在两个数组之间计算距离？你可以尝试使用 scipy.spatial.distance.cdist 或 scipy.spatial.distance.pdist。 - user2034412

@WarrenWeckesser 是的，它们是相同的顺序。根据一些参数设置，它们的长度将在20到100之间。 - benbo

scipy/spatial/distance.py文件中的hamming(u, v)函数：...return (u != v).mean()。另请参见bitarray。 - denis

5个回答

8

在我的平台上，对于a!=b的np.count_nonzero(a!=b)，它需要1.07微秒，而将每个数组转换为mpz(多精度整数)后，gmpy2.hamdist可以将其减少到约143纳秒:

import numpy as np
from gmpy2 import mpz, hamdist, pack

a = np.array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0])
b = np.array([1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1])

根据@casevh的提示，使用gmpy2.pack(list(reversed(list(array))),1)可以相对高效地将一维的0和1数组转换为gmpy2 mpz对象。

# gmpy2.pack reverses bit order but that does not affect
# hamdist since both its arguments are reversed
ampz = pack(list(a),1) # takes about 4.29µs
bmpz = pack(list(b),1)

hamdist(ampz,bmpz)
Out[8]: 7

%timeit hamdist(ampz,bmpz)
10000000 loops, best of 3: 143 ns per loop

相对比较而言，在我的平台上：

%timeit np.count_nonzero(a!=b)
1000000 loops, best of 3: 1.07 µs per loop

%timeit len((a != b).nonzero()[0])
1000000 loops, best of 3: 1.55 µs per loop

%timeit len(np.bitwise_xor(a,b).nonzero()[0])
1000000 loops, best of 3: 1.7 µs per loop

%timeit np.sum(np.bitwise_xor(a,b))
100000 loops, best of 3: 5.8 µs per loop

- user4322779

3

公平起见，您可能需要考虑将输入数组转换为mpz格式所需的时间。 - Warren Weckesser

3

你可以使用gmpy2.pack(list(a),1)将numpy数组转换为mpz类型。这比使用convert2mpz()更快。如果你考虑转换时间，它仍然比numpy解决方案慢。 - casevh

1

如果您想构建与原始代码相同的mpz，则确实需要使用reversed()。但是，汉明距离不取决于位的顺序（即从高到低还是从低到高）。只要两个数组具有相同的长度，以便将相同的位位置相互比较，汉明距离就会相同。 - casevh

1

有人知道自此发布以来是否有任何更改吗？当我尝试在此帖子中复制并粘贴a和b的导入和定义后使用pack时，我会收到以下错误： TypeError: pack()要求列表元素为正整数<2 ^ n位 - Daniel Crane

1

对我来说，将 list(a) 更改为 a.tolist() 有效。 - Carlo Bono

显示剩余4条评论

6

使用pythran可以在这里带来额外的好处:

$ cat hamm.py
#pythran export hamm(int[], int[])
from numpy import nonzero
def hamm(a,b):
    return len(nonzero(a != b)[0])

作为参考（不使用pythran）：

$ python -m timeit -s 'import numpy as np; a = np.random.randint(0,2, 100); b = np.random.randint(0,2, 100); from hamm import hamm' 'hamm(a,b)'
100000 loops, best of 3: 4.66 usec per loop

在使用Pythran编译后：

$ python -m pythran.run hamm.py
$ python -m timeit -s 'import numpy as np; a = np.random.randint(0,2, 100); b = np.random.randint(0,2, 100); from hamm import hamm' 'hamm(a,b)'
1000000 loops, best of 3: 0.745 usec per loop

这大致是numpy实现的6倍速度提升，因为pythran在评估逐元素比较时跳过了中间数组的创建。

我还进行了测量：

def hamm(a,b):
    return count_nonzero(a != b)

对于Python版本，我得到了每次循环需要3.11微秒，而使用Pythran时仅需要0.427微秒。

声明：我是Pythran的开发者之一。

- serge-sans-paille

0

我建议您使用np.packbits将numpy位数组转换为numpy uint8数组。

可以查看scipy的spatial.distance.hamming： https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.hamming.html

否则，这里有一个小片段，只需要numpy的灵感，受到Fast way of counting non-zero bits in positive integer的启发：

bit_counts = np.array([int(bin(x).count("1")) for x in range(256)]).astype(np.uint8)
def hamming_dist(a,b,axis=None):
    return np.sum(bit_counts[np.bitwise_xor(a,b)],axis=axis)

使用axis=-1，可以计算一个条目与一个大数组之间的汉明距离；例如：

inp = np.uint8(np.random.random((512,8))*255) #512 entries of 8 byte
hd = hamming_dist(inp, inp[123], axis=-1) #results in 512 hamming distances to entry 123
idx_best = np.argmin(hd)    # should point to identity 123
hd[123] = 255 #mask out identity
idx_nearest= np.argmin(hd)    # should point entry in list with shortest distance to entry 123
dist_hist = np.bincount(np.uint8(hd)) # distribution of hamming distances; for me this started at 18bits to 44bits with a maximum at 31

- Oliver Zendel

0

对于字符串而言，它的工作速度更快。

def Hamm(a, b):
    c = 0
    for i in range(a.shape[0]):
        if a[i] != b[i]:
            c += 1
    return c

- Micra

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- yevgeniy · Accepted Answer

有一个现成的NumPy函数可以代替len((a != b).nonzero()[0]) ;)

np.count_nonzero(a!=b)