在Pandas DataFrame中找到非NaN值的索引

Question

在Pandas DataFrame中找到非NaN值的索引

4

我有一个非常庞大的数据集（大约200000x400），但是我已经对其进行了过滤，只剩下几百个值，其余的都是NaN。我想创建一个包含这些剩余值索引的列表。我找不到一个简单的解决方案。

    0     1     2
0   NaN   NaN   1.2
1   NaN   NaN   NaN   
2   NaN   1.1   NaN   
3   NaN   NaN   NaN
4   1.4   NaN   1.01

例如，我想要一个列表[(0,2)，(2,1)，(4,0)，(4,2)]。

- pbell

请选择Nickil Maveli的答案作为正确答案 - 它更快且更符合惯用语。 - MaxU - stand with Ukraine

2个回答

1

假设您的列名是int数据类型：

In [73]: df
Out[73]:
     0    1     2
0  NaN  NaN  1.20
1  NaN  NaN   NaN
2  NaN  1.1   NaN
3  NaN  NaN   NaN
4  1.4  NaN  1.01

In [74]: df.columns.dtype
Out[74]: dtype('int64')

In [75]: df.stack().reset_index().drop(0, 1).apply(tuple, axis=1).tolist()
Out[75]: [(0, 2), (2, 1), (4, 0), (4, 2)]

如果你的列名是 object 数据类型：

In [81]: df.columns.dtype
Out[81]: dtype('O')

In [83]: df.stack().reset_index().astype(int).drop(0,1).apply(tuple, axis=1).tolist()
Out[83]: [(0, 2), (2, 1), (4, 0), (4, 2)]

50K行数据框的计时：

In [89]: df = pd.concat([df] * 10**4, ignore_index=True)

In [90]: df.shape
Out[90]: (50000, 3)

In [91]: %timeit list(map(tuple, np.argwhere(~np.isnan(df.values))))
10 loops, best of 3: 144 ms per loop

In [92]: %timeit df.stack().reset_index().drop(0, 1).apply(tuple, axis=1).tolist()
1 loop, best of 3: 1.67 s per loop

结论：Nickil Maveli的解决方案在这个测试数据框中快了12倍。

- MaxU - stand with Ukraine

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Nickil Maveli · Accepted Answer

将数据框转换为相应的NumPy数组表示，并检查是否存在NaNs。随后，使用numpy.argwhere取其对应索引的否定值（表示非空）。由于所需的输出必须是元组列表，因此您可以利用生成器map函数将tuple作为函数应用于结果数组的每个可迭代对象。

>>> list(map(tuple, np.argwhere(~np.isnan(df.values))))
[(0, 2), (2, 1), (4, 0), (4, 2)]