如何使用字典对数据框进行子集筛选?

8

假设有一个DataFrame,其中大部分列是分类数据。

> data.head()
  age risk     sex smoking
0  28   no    male      no
1  58   no  female      no
2  27   no    male     yes
3  26   no    male      no
4  29  yes  female     yes

我希望可以通过一个由键值对组成的字典来对这些分类变量进行子集筛选。

tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'}

因此,我希望有以下子集。
data[ (data.risk == 'no') & (data.smoking == 'yes') & (data.sex == 'female')]

我想要做的是:

data[tmp]

什么是最符合Python/Pandas方法的做法?
最小示例:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

x = Series(random.randint(0,2,50), dtype='category')
x.cat.categories = ['no', 'yes']

y = Series(random.randint(0,2,50), dtype='category')
y.cat.categories = ['no', 'yes']

z = Series(random.randint(0,2,50), dtype='category')
z.cat.categories = ['male', 'female']

a = Series(random.randint(20,60,50), dtype='category')

data = DataFrame({'risk':x, 'smoking':y, 'sex':z, 'age':a})

tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'}
5个回答

6

我会使用.query()方法来完成此任务:

qry = ' and '.join(["{} == '{}'".format(k,v) for k,v in tmp.items()])    

data.query(qry)

输出:

   age risk     sex smoking
7   24   no  female     yes
22  43   no  female     yes
23  42   no  female     yes
25  24   no  female     yes
32  29   no  female     yes
40  34   no  female     yes
43  35   no  female     yes

查询字符串:

print(qry)
"sex == 'female' and risk == 'no' and smoking == 'yes'"

4
您可以使用列表推导式和 concatall 进行操作:
import numpy as np
import pandas as pd

np.random.seed(123)
x = pd.Series(np.random.randint(0,2,10), dtype='category')
x.cat.categories = ['no', 'yes']
y = pd.Series(np.random.randint(0,2,10), dtype='category')
y.cat.categories = ['no', 'yes']
z = pd.Series(np.random.randint(0,2,10), dtype='category')
z.cat.categories = ['male', 'female']

a = pd.Series(np.random.randint(20,60,10), dtype='category')

data = pd.DataFrame({'risk':x, 'smoking':y, 'sex':z, 'age':a})
print (data)
  age risk     sex smoking
0  24   no    male     yes
1  23  yes    male     yes
2  22   no  female      no
3  40   no  female     yes
4  59   no  female      no
5  22   no    male     yes
6  40   no  female      no
7  27  yes    male     yes
8  55  yes    male     yes
9  48   no    male      no

tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'}
mask = pd.concat([data[x[0]].eq(x[1]) for x in tmp.items()], axis=1).all(axis=1)
print (mask)
0    False
1    False
2    False
3     True
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

df1 = data[mask]
print (df1)
 age risk     sex smoking
3  40   no  female     yes

L = [(x[0], x[1]) for x in tmp.items()]
print (L)
[('smoking', 'yes'), ('sex', 'female'), ('risk', 'no')]

L = pd.concat([data[x[0]].eq(x[1]) for x in tmp.items()], axis=1)
print (L)
  smoking    sex   risk
0    True  False   True
1    True  False  False
2   False   True   True
3    True   True   True
4   False   True   True
5    True  False   True
6   False   True   True
7    True  False  False
8    True  False  False
9   False  False   True

Timings:

len(data)=1M.

N = 1000000
np.random.seed(123)
x = pd.Series(np.random.randint(0,2,N), dtype='category')
x.cat.categories = ['no', 'yes']
y = pd.Series(np.random.randint(0,2,N), dtype='category')
y.cat.categories = ['no', 'yes']
z = pd.Series(np.random.randint(0,2,N), dtype='category')
z.cat.categories = ['male', 'female']

a = pd.Series(np.random.randint(20,60,N), dtype='category')

data = pd.DataFrame({'risk':x, 'smoking':y, 'sex':z, 'age':a})

#[1000000 rows x 4 columns]
print (data)


tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'}


In [133]: %timeit (data[pd.concat([data[x[0]].eq(x[1]) for x in tmp.items()], axis=1).all(axis=1)])
10 loops, best of 3: 89.1 ms per loop

In [134]: %timeit (data.query(' and '.join(["{} == '{}'".format(k,v) for k,v in tmp.items()])))
1 loop, best of 3: 237 ms per loop

In [135]: %timeit (pd.merge(pd.DataFrame(tmp, index =[0]), data.reset_index()).set_index('index'))
1 loop, best of 3: 256 ms per loop

3
您可以从字典中创建一个查找数据帧,然后与data进行内连接,这将产生与query相同的效果。
from pandas import merge, DataFrame
merge(DataFrame(tmp, index =[0]), data)

enter image description here


如果你想要保留索引呢? - lanery
@Ianery merge(DataFrame(tmp, index =[0]), data.reset_index()).set_index('index') 可以运行,但不是最理想的方法。 - Psidom

2
你可以构建一个布尔向量来检查这些属性,但可能有更好的方法:
df[risk == 'no' and smoking == 'yes' and sex == 'female' for (age, risk, sex, smoking) in df.itertuples()]

无效语法:抱怨“for” - Rho

1

我认为你可以在数据框上使用to_dict方法,然后使用列表推导式进行过滤:

df = pd.DataFrame(data={'age':[28, 29], 'sex':["M", "F"], 'smoking':['y', 'n']})
print df
tmp = {'age': 28, 'smoking': 'y', 'sex': 'M'}

print pd.DataFrame([i for i in df.to_dict('records') if i == tmp])


>>>    age sex smoking
0   28   M       y
1   29   F       n

   age sex smoking
0   28   M       y

您可以将tmp转换为一个系列:
ts = pd.Series(tmp)

print pd.DataFrame([i[1] for i in df.iterrows() if i[1].equals(ts)])

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接