这里有两种使用 NumPy
的方法,使用 np.cumsum
来创建这些ramp数组 -
def id_ramp(a):
out = np.ones(a.size,dtype=int)
idx = np.nonzero(np.append(True,a[1:] > a[:-1]))[0]
out[idx[1:]] = -idx[1:] + idx[:-1] + 1
return out.cumsum()
def id_ramp2(a):
out = np.ones(a.size,dtype=int)
idx = np.nonzero(a[1:] > a[:-1])[0]+1
out[idx[0]] = -idx[0]+1
out[idx[1:]] = idx[:-1] - idx[1:]+1
return out.cumsum()
运行时测试 -
In [381]: a = np.sort(np.random.randint(1,100,(1000)))
In [382]: df = pd.DataFrame(a, columns=[['ID']])
In [383]: %timeit df['SEQ'] = df.groupby('ID').cumcount()+1
100 loops, best of 3: 2.01 ms per loop
In [384]: %timeit df['SEQ'] = id_ramp(df.ID.values)
1000 loops, best of 3: 315 µs per loop
In [385]: %timeit df['SEQ'] = id_ramp2(df.ID.values)
1000 loops, best of 3: 304 µs per loop
如果您正在使用未始终排序的ID
列,则需要在那里使用一些argsort
,如下所示 -
a = df.ID.values
sidx = a.argsort(kind='mergesort')
df['SEQ'] = id_ramp2(a[sidx])[sidx.argsort()]
让我们看一个示例来了解它的工作原理 -
In [447]: df
Out[447]:
ID
0 1
1 1
2 7
3 5
4 3
5 8
6 1
7 3
8 7
9 2
10 5
11 7
In [448]: a = df.ID.values
...: sidx = a.argsort(kind='mergesort')
...: df['SEQ'] = id_ramp2(a[sidx])[sidx.argsort()]
...:
In [449]: df
Out[449]:
ID SEQ
0 1 1
1 1 2
2 7 1
3 5 1
4 3 1
5 8 1
6 1 3
7 3 2
8 7 2
9 2 1
10 5 2
11 7 3