使用NumPy实现Pandas的nunique相当于函数

Question

使用NumPy实现Pandas的nunique相当于函数

9

在numpy中是否有类似pandas的nunique函数可以对每行进行去重统计？我尝试使用带有return_counts参数的np.unique函数，但它似乎不能返回我想要的结果。例如：

a = np.array([[120.52971, 75.02052, 128.12627], [119.82573, 73.86636, 125.792],
       [119.16805, 73.89428, 125.38216],  [118.38071, 73.35443, 125.30198],
       [118.02871, 73.689514, 124.82088]])
uniqueColumns, occurCount = np.unique(a, axis=0, return_counts=True) ## axis=0 row-wise

成果：

>>>ccurCount
array([1, 1, 1, 1, 1], dtype=int64)

我应该期望全部为3而不是全部为1。

当然，解决方法是转换为 pandas 并调用nunique，但存在速度问题，我想探索纯 numpy 实现以加快速度。我正在使用大型数据帧，因此希望在任何地方找到加速。我也可以接受其他解决方案以提高速度。

- user1234440

它将是 pd.DataFrame(a).nunique(axis=1)。 - user1234440

2

你希望如何在3列中获得4个？ - Mad Physicist

@MadPhysicist 我正在进行逐行唯一计数。有5行，因此我期望得到一个长度为5的数组，其中每个元素计算第n行的唯一值数量。如果问题可以更好地表述，请告诉我。谢谢。 - user1234440

1

np.unique(a, axis=0) 可以给你独特的行而不是每行独特的元素。 - scleronomic

1

这个回答解决了你的问题吗？NumPy数组中每行唯一元素的数量 - anishtain4

显示剩余3条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Divakar · Accepted Answer

我们可以使用一些排序和连续差异 -

a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)

为了提高性能，我们可以使用切片来替代np.diff -

a_s = np.sort(a,axis=1)
out = a.shape[1]-(a_s[:,:-1] == a_s[:,1:]).sum(1)

如果你想要引入一定的容差值来检查唯一性，我们可以使用 np.isclose 函数。

a.shape[1]-(np.isclose(np.diff(np.sort(a,axis=1),axis=1),0)).sum(1)

样例运行 -

In [51]: import pandas as pd

In [48]: a
Out[48]: 
array([[120.52971 , 120.52971 , 128.12627 ],
       [119.82573 ,  73.86636 , 125.792   ],
       [119.16805 ,  73.89428 , 125.38216 ],
       [118.38071 , 118.38071 , 118.38071 ],
       [118.02871 ,  73.689514, 124.82088 ]])

In [49]: pd.DataFrame(a).nunique(axis=1).values
Out[49]: array([2, 3, 3, 1, 3])

In [50]: a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)
Out[50]: array([2, 3, 3, 1, 3])

在具有随机数字和每行至少2个唯一数字的简单情况下的时间 -

In [41]: np.random.seed(0)
    ...: a = np.random.rand(10000,5)
    ...: a[:,-1] = a[:,0]

In [42]: %timeit pd.DataFrame(a).nunique(axis=1).values
    ...: %timeit a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)
1.31 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
758 µs ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [43]: %%timeit
    ...: a_s = np.sort(a,axis=1)
    ...: out = a.shape[1]-(a_s[:,:-1] == a_s[:,1:]).sum(1)
694 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)