这里是一种方法 -
def groupby_uniqueness_dict(a):
sidx = a.argsort()
b = a[sidx]
cut_idx = np.flatnonzero(b[1:] != b[:-1])+1
parts = np.split(sidx, cut_idx)
out = dict(zip(b[np.r_[0,cut_idx]], parts))
return out
更高效的方法是避免使用
np.split
-
def groupby_uniqueness_dict_v2(a):
sidx = a.argsort()
b = a[sidx]
cut_idx = np.flatnonzero(b[1:] != b[:-1])+1
idxs = np.r_[0,cut_idx, len(b)+1]
out = {b[i]:sidx[i:j] for i,j in zip(idxs[:-1], idxs[1:])}
return out
示例运行 -
In [161]: a
Out[161]: array([1, 6, 6, 4, 1, 1, 4])
In [162]: groupby_uniqueness_dict(a)
Out[162]: {1: array([0, 4, 5]), 4: array([3, 6]), 6: array([1, 2])}
运行时测试
其他方法 -
from collections import defaultdict
def defaultdict_app(a):
d = defaultdict(list)
for ix, val in enumerate(a):
d[val].append(ix)
return d
计时 -
案例#1:字典值作为数组
In [226]: a = np.random.randint(0,1000, 10000)
In [227]: %timeit defaultdict_app(a)
...: %timeit groupby_uniqueness_dict(a)
...: %timeit groupby_uniqueness_dict_v2(a)
100 loops, best of 3: 4.06 ms per loop
100 loops, best of 3: 3.06 ms per loop
100 loops, best of 3: 2.02 ms per loop
In [228]: a = np.random.randint(0,10000, 100000)
In [229]: %timeit defaultdict_app(a)
...: %timeit groupby_uniqueness_dict(a)
...: %timeit groupby_uniqueness_dict_v2(a)
10 loops, best of 3: 43.5 ms per loop
10 loops, best of 3: 29.1 ms per loop
100 loops, best of 3: 19.9 ms per loop
案例#2:字典值为列表
In [238]: a = np.random.randint(0,1000, 10000)
In [239]: %timeit defaultdict_app(a)
...: %timeit groupby_uniqueness_dict(a)
...: %timeit groupby_uniqueness_dict_v2(a)
100 loops, best of 3: 4.15 ms per loop
100 loops, best of 3: 4.5 ms per loop
100 loops, best of 3: 2.44 ms per loop
In [240]: a = np.random.randint(0,10000, 100000)
In [241]: %timeit defaultdict_app(a)
...: %timeit groupby_uniqueness_dict(a)
...: %timeit groupby_uniqueness_dict_v2(a)
10 loops, best of 3: 57.5 ms per loop
10 loops, best of 3: 54.6 ms per loop
10 loops, best of 3: 34 ms per loop