基于numpy数组的行生成唯一值

Question

基于numpy数组的行生成唯一值

pythonarraysnumpydictionaryhash

3

我有一个形状为m*n*k的3D numpy数组arr。

针对每个沿着m轴的值集合（例如arr[:, 0, 0]），我想生成一个单一的值来代表这个集合，以便最终得到一个n*k的2D矩阵。如果沿着m轴的值集合重复出现，则应该生成相同的值。

这是一个哈希问题。

我使用字典创建了一个解决方案，但它会严重降低性能。对于每组值，我调用此函数：

 def getCellId(self, valueSet):

     # Turn the set of values (a numpy vector) to a tuple so it can be hashed
     key = tuple(valueSet)

     # Try and simply return an existing ID for this key
     try:
       return self.attributeDict[key]
     except KeyError:

       # If the key was new (and didnt exist), try and generate a new Id by adding one to the max of all current Id's. This will fail the very first time we do this (as there will be no Id's yet), so in that case, just assign the value '1' to the newId
       try:
         newId = max(self.attributeDict.values()) +1
       except ValueError:
         newId = 1
       self.attributeDict[key] = newId
       return newId

这个数组通常的大小是30*256*256，因此一组值将有30个。我每次需要处理数百个这样的数组。目前，对于100个数组的数据块，执行到计算哈希所需的所有处理时间为1.3秒。包括哈希处理时间增加到75秒。

是否有更快的方法来生成单个代表值？

- jramm

1

代表值必须看起来漂亮吗？还是可以是“任何东西”？ - plonser

@divakar，是的，一直如此。 - jramm

我在想，是否有一种基于numpy.cross的解决方案？这可能会带来非常好的性能。 - jramm

@jramm 是指在谁之间进行“交叉”？不确定“交叉”如何帮助您。 - Divakar

一个三维数组的内容是否会改变，如果改变了，解决方案是否需要生成一个新的键？ - deinonychusaur

显示剩余4条评论

3个回答

1

这可能是一种使用基本的numpy函数的方法 -

import numpy as np

# Random input for demo
arr = np.random.randint(0,3,[2,5,4])

# Get dimensions for later usage
m,n,k = arr.shape

# Reshape arr to a 2D array that has each slice arr[:, n, k] in each row
arr2d = np.transpose(arr,(1,2,0)).reshape([-1,m])

# Perform lexsort & get corresponding indices and sorted array 
sorted_idx = np.lexsort(arr2d.T)
sorted_arr2d =  arr2d[sorted_idx,:]

# Differentiation along rows for sorted array
df1 = np.diff(sorted_arr2d,axis=0)

# Look for changes along df1 that represent new labels to be put there
df2 = np.append([False],np.any(df1!=0,1),0)

# Get unique labels
labels = df2.cumsum(0)

# Store those unique labels in a n x k shaped 2D array
pos_labels = np.zeros_like(labels)
pos_labels[sorted_idx] = labels
out = pos_labels.reshape([n,k])

样例运行 -

In [216]: arr
Out[216]: 
array([[[2, 1, 2, 1],
        [1, 0, 2, 1],
        [2, 0, 1, 1],
        [0, 0, 1, 1],
        [1, 0, 0, 2]],

       [[2, 1, 2, 2],
        [0, 0, 2, 1],
        [2, 1, 0, 0],
        [1, 0, 1, 0],
        [0, 1, 1, 0]]])

In [217]: out
Out[217]: 
array([[6, 4, 6, 5],
       [1, 0, 6, 4],
       [6, 3, 1, 1],
       [3, 0, 4, 1],
       [1, 3, 3, 2]], dtype=int32)

- Divakar

0

如果只是关于哈希，请尝试这个。

import numpy as np
import numpy.random

# create random data
a = numpy.random.randint(10,size=(5,3,3))

# create some identical 0-axis data
a[:,0,0] = np.arange(5)
a[:,0,1] = np.arange(5)

# create matrix with the hash values
h = np.apply_along_axis(lambda x: hash(tuple(x)),0,a)

h[0,0]==h[0,1]
# Output: True

然而，使用时请小心并先测试您的代码与此代码。...我只能说它适用于这个简单的例子。

此外，可能存在两个值具有相同的哈希值，尽管它们是不同的。这是使用哈希函数时始终可能发生的问题，但它们非常不太可能。

编辑：为了与其他解决方案进行比较

timeit(np.apply_along_axis(lambda x: hash(tuple(x)),0,a))
# output: 1 loops, best of 3: 677 ms per loop

- plonser

尝试使用我的 hashlib.md5 和 tostring 解决方案，你应该能够节省一些时间。 - deinonychusaur

1

@deinonychusaur：我完全同意python内置的hash速度较慢...但我不想从其他解决方案中窃取想法;)...除此之外，我仍然想知道他是否希望在矩阵中使用“好看”的整数还是一些“丑陋”的哈希整数。 - plonser

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- deinonychusaur · Accepted Answer

根据需要生成的新密钥和旧密钥数量的不同，很难确定最佳方案。但按照您的逻辑，以下方法应该相当快：

import collections
import hashlib

_key = 0

def _get_new_key():
    global _key
    _key += 1
    return _key

attributes = collections.defaultdict(_get_new_key)

def get_cell_id(series):                             
    global attributes
    return attributes[hashlib.md5(series.tostring()).digest()]

编辑：

根据您的问题，我现在更新了循环所有数据系列的内容，使用了步长：

In [99]: import numpy as np

In [100]: A = np.random.random((30, 256, 256))

In [101]: A_strided = np.lib.stride_tricks.as_strided(A, (A.shape[1] * A.shape[2], A.shape[0]), (A.itemsize, A.itemsize * A.shape[1] * A.shape[2]))

In [102]: %timeit tuple(get_cell_id(S) for S in A_strided)
10 loops, best of 3: 169 ms per loop

上述代码每次执行256x256次查找/赋值，每个数组有30个元素。当然，无法保证md5哈希不会发生冲突。如果这是一个问题，您可以更改为同一库中的其他哈希算法。 编辑2： 考虑到您似乎在3D数组的第一个轴上执行大部分耗时操作，建议您重新组织数组。

In [254]: A2 = np.random.random((256, 256, 30))

In [255]: A2_strided = np.lib.stride_tricks.as_strided(A2, (A2.shape[0] * A2.shape[1], A2.shape[2]), (A2.itemsize * A2.shape[2], A2.itemsize))

In [256]: %timeit tuple(get_cell_id(S) for S in A2_strided)
10 loops, best of 3: 126 ms per loop

不需要在内存中跳跃长距离，可以提高约25%的速度。 编辑3： 如果没有实际需要将哈希缓存到int查找中，而只需要实际的哈希值，并且3D数组是int8类型，则可以进一步减少时间，给定A2和A2_strided组织方式。其中15ms是元组循环的时间。

In [9]: from hashlib import md5

In [10]: %timeit tuple(md5(series.tostring()).digest() for series in A2_strided) 
10 loops, best of 3: 72.2 ms per loop