Pyspark：在groupBy之后获取百分比结果

Question

Pyspark：在groupBy之后获取百分比结果

3

例如，这是我的测试数据。

test = spark.createDataFrame([
    (0, 1, 5, "2018-06-03", "Region A"),
    (1, 1, 2, "2018-06-04", "Region B"),
    (2, 2, 1, "2018-06-03", "Region B"),
    (3, 3, 1, "2018-06-01", "Region A"),
    (3, 1, 3, "2018-06-05", "Region A"),
])\
  .toDF("orderid", "customerid", "price", "transactiondate", "location")
test.show()

我可以获取类似这样的摘要数据。

test.groupBy("customerid", "location").agg(sum("price")).show()

但我也想要百分比数据，类似于这样

+----------+--------+----------+ 
|customerid|location|sum(price)| percentage
+----------+--------+----------+ 
|         1|Region B|         2|    20%
|         1|Region A|         8|    80%
|         3|Region A|         1|    100%
|         2|Region B|         1|    100%
+----------+--------+----------+

我想知道：

我应该怎么做？也许可以使用窗口函数？
我能否将透视表转化为像这样的形式？（带有百分比和总和列）

我只在Pandas中如何在groupby后获取一列计数的百分比中找到了一个示例。

更新：

在@Gordon Linoff的帮助下，我可以通过以下方式获得百分比：

from pyspark.sql.window import Window
test.groupBy("customerid", "location").agg(sum("price"))\
  .withColumn("percentage", col("sum(price)")/sum("sum(price)").over(Window.partitionBy(test['customerid']))).show()

- cqcn1991

2个回答

2

这个回答是针对原始问题的。

在SQL中，你可以使用窗口函数：

select customerid, location, sum(price),
       (sum(price) / sum(sum(price)) over (partition by customerid) as ratio
from t
group by customerid, location;

- Gordon Linoff

嗨，我现在能够让它工作了。非常感谢你。另外，我能否对结果进行数据透视表处理？我已经更新了我的问题。 - ZK Zhao

@cqcn1991...新的问题应该作为新的问题提出，而不是通过编辑现有问题来使答案无效。 - Gordon Linoff

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Brandonnnn · Accepted Answer

这是一个适用于你问题的干净代码：

from pyspark.sql import functions as F
from pyspark.sql.window import Window

(test.groupby("customerid", "location")
      .agg(F.sum("price").alias("t_price"))
      .withColumn("perc", F.col("t_price") / F.sum("t_price").over(Window.partitionBy("customerid")))