Pandas SQL中update where group by的等效语句

3

虽然我一直在寻找,但我无法找到在pandas中实现此查询的正确方法。

update product
  set maxrating = (select max(rating)
                   from rating
                   where source = 'customer'
                     and product.sku = rating.sku
                   group by sku)
  where maxrating is null;

Pandas

product = pd.DataFrame({'sku':[1,2,3],'maxrating':[0,0,1]})
rating = pd.DataFrame({'sku':[1,1,2,3,3],'rating':[2,5,3,5,4],'source':['retailer','customer','customer','retailer','customer']})
expected_result = pd.DataFrame({'sku':[1,2,3],'maxrating':[5,3,1]})

SQL

drop table if exists product;
create table product(sku integer primary key, maxrating int);
insert into product(maxrating) values(null),(null),(1);
drop table if exists rating; create table rating(sku int, rating int, source text);
insert into rating values(1,2,'retailer'),(1,5,'customer'),(2,3,'customer'),(2,5,'retailer'),(3,3,'retailer'),(3,4,'customer');
update product
  set maxrating = (select max(rating)
                   from rating
                   where source = 'customer'
                     and product.sku = rating.sku
                   group by sku)
  where maxrating is null;
select *
from product;

如何完成?
3个回答

4

试试这个:

In [220]: product.ix[product.maxrating == 0, 'maxrating'] = product.sku.map(rating.groupby('sku')['rating'].max())

In [221]: product
Out[221]:
   maxrating  sku
0          5    1
1          3    2
2          1    3

或者使用普通的口罩:

In [222]: mask = (product.maxrating == 0)

In [223]: product.ix[mask, 'maxrating'] = product.ix[mask, 'maxrating'].map(rating.groupby('sku')['rating'].max())

In [224]: product
Out[224]:
   maxrating  sku
0          5    1
1          3    2
2          1    3

哇!!!这正是我在寻找的,非常感谢!现在,我只需要找出是否可以使用多列而不仅仅是一系列的映射方法,否则我将只使用一个计算列。 - Arthur Burkhardt
@ArthurBurkhardt,欢迎您!我建议您打开一个新的问题,并提供样本和所需数据集。当您提出这个问题时,做得非常好 - 如果我们有用于生成输入和所需数据集的代码,SO社区更容易回答问题。 - MaxU - stand with Ukraine

2

全在一起

product.maxrating = product.maxrating.replace(0, np.nan)
missing = product.loc[product.maxrating.isnull(), 'sku']
missingmax = rating.groupby(missing, as_index=False).rating.agg({'maxrating': 'max'})
product.update(missingmax)

首先,让我们从空值开始,而不是零值

product.maxrating = product.maxrating.replace(0, np.nan)
product

enter image description here

接下来,确定缺失的'sku'并在groupby中使用它们来计算missingmax

missing = product.loc[product.maxrating.isnull(), 'sku']
missingmax = rating.groupby(missing, as_index=False).rating.agg({'maxrating': 'max'})

missingmax

在此输入图片描述

使用 update 命令。

product.update(missingmax)
product

enter image description here


1
你可以做以下事情:

In [127]: df = pd.merge(rating, product, on='sku')

In [128]: df1 = df[df['maxrating'] == 0].groupby('sku').agg({'rating': np.max}).reset_index().rename(columns={'rating': 'maxrating'})

In [129]: df2 = df[df['maxrating'] != 0][['sku', 'maxrating']].drop_duplicates(keep='first')

In [131]: pd.concat([df1, df2])
Out[131]: 
   sku  maxrating
0    1          5
1    2          3
3    3          1

In [132]: expected_result
Out[132]: 
   sku  maxrating
0    1          5
1    2          3
2    3          1

基本上,我合并了两个数据框,然后提取需要处理的行(那些没有最大评级的行),并为它们找到实际的最大评级。完成后,我将结果与我排除的行(那些有最大评级的行)连接起来,最终得到了预期的结果。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接