Python Pandas中是否有与SQL中"SELECT GROUP BY a HAVING COUNT(1) > 1"等效的语句？

Question

Python Pandas中是否有与SQL中"SELECT GROUP BY a HAVING COUNT(1) > 1"等效的语句？

15

我在pandas中很难筛选groupby项目。我想要做的是

select email, count(1) as cnt 
from customers 
group by email 
having count(email) > 1 
order by cnt desc

我做了

customers.groupby('Email')['CustomerID'].size()

它正确地给出了电子邮件列表及其相应的计数，但我无法实现 having count(email) > 1 部分。

email_cnt[email_cnt.size > 1]

返回值为1

email_cnt = customers.groupby('Email')
email_dup = email_cnt.filter(lambda x:len(x) > 2)

提供了所有 email > 1 的客户记录，但我需要聚合表格。

- tangkk

2个回答

6

另外两个使用现代“方法链”方式的解决方案：

使用可调用选择：

customers.groupby('Email').size().loc[lambda x: x>1].sort_values()

使用查询方法（query method）：

通过查询方法：

(customers.groupby('Email')['CustomerID'].
    agg([len]).query('len > 1').sort_values('len'))

- Ilya V. Schurov

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alex Riley · Accepted Answer

不要再写成email_cnt[email_cnt.size > 1]，直接写成email_cnt[email_cnt > 1]就行了（无需再调用.size）。这样可以使用布尔序列email_cnt > 1来仅返回相关的email_cnt值。

例如：

>>> customers = pd.DataFrame({'Email':['foo','bar','foo','foo','baz','bar'],
                              'CustomerID':[1,2,1,2,1,1]})
>>> email_cnt = customers.groupby('Email')['CustomerID'].size()
>>> email_cnt[email_cnt > 1]
Email
bar      2
foo      3
dtype: int64