什么是检查Pandas数据框中元素的最快方法？

Question

什么是检查Pandas数据框中元素的最快方法？

3

我对检查Pandas数据帧列中的项目的最佳方法有些困惑。

我正在编写一个程序，如果数据帧在某个列中有不允许的元素，就会引发错误。

以下是一个示例：

import pandas as pd

raw_data = {'first_name': ['Jay', 'Jason', 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Jones', 'Miller', 'Ali', 'Milner', 'Cooze'], 
        'age': [47, 42, 36, 24, 73], 
        'preTestScore': [4, 4, 31, 2, 3],
        'postTestScore': [27, 25, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
print(df)

输出结果

  first_name last_name  age  preTestScore  postTestScore
0      Jay       Jones   47             4             27
1      Jason    Miller   42             4             25
2       Tina       Ali   36            31             57
3       Jake    Milner   24             2             62
4        Amy     Cooze   73             3             70

如果列last_name中除了Jones、Miller、Ali、Milner或Cooze之外还包含其他内容，则发出警告。可能可以使用pandas.DataFrame.isin，但我不确定这是否是最有效的方法。类似以下代码：

if df.isin('last_name':{'Jones', 'Miller', 'Ali', 'Milner', 'Cooze'}).any() == False:
    raise:
        ValueError("Column `last_name` includes ill-formed elements.")

- EB2127

2个回答

2

你可以使用 loc：

if df.loc[~df['last_name'].isin({'Jones', 'Miller', 'Ali', 'Milner', 'Cooze'}), 'last_name'].any():
    raise ValueError("Column `last_name` includes ill-formed elements.")

这个检查是否在last_name中指定的值之外还有其他值。

- zipa

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jezrael · Accepted Answer

我认为您可以使用all来检查是否匹配所有值：

if not df['last_name'].isin(['Jones', 'Miller', 'Ali', 'Milner', 'Cooze']).all():
    raise ValueError("Column `last_name` includes ill-formed elements.")

使用issubset的另一种解决方案：

if not set(['Jones', 'Miller', 'Ali', 'Milner', 'Cooze']).issubset(df['last_name']):
    raise ValueError("Column `last_name` includes ill-formed elements.")

时间:

np.random.seed(123)
N = 10000
L = list('abcdefghijklmno') 

df = pd.DataFrame({'last_name': np.random.choice(L, N)})
print (df)

In [245]: %timeit df['last_name'].isin(L).all()
The slowest run took 4.73 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 421 µs per loop

In [247]: %timeit set(L).issubset(df['last_name'])
The slowest run took 4.50 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 273 µs per loop

In [248]: %timeit df.loc[~df['last_name'].isin(L), 'last_name'].any()
1000 loops, best of 3: 562 µs per loop

注意：

性能确实取决于数据-行数和非匹配值的数量。