Pyspark：按列进行加权平均

Question

Pyspark：按列进行加权平均

3

例如，我有这样一个数据集：

test = spark.createDataFrame([
    (0, 1, 5, "2018-06-03", "Region A"),
    (1, 1, 2, "2018-06-04", "Region B"),
    (2, 2, 1, "2018-06-03", "Region B"),
    (3, 3, 1, "2018-06-01", "Region A"),
    (3, 1, 3, "2018-06-05", "Region A"),
])\
  .toDF("orderid", "customerid", "price", "transactiondate", "location")
test.show()

我可以通过以下方法获取客户区域订单计数矩阵：

overall_stat = test.groupBy("customerid").agg(count("orderid"))\
  .withColumnRenamed("count(orderid)", "overall_count")
temp_result = test.groupBy("customerid").pivot("location").agg(count("orderid")).na.fill(0).join(overall_stat, ["customerid"])

for field in temp_result.schema.fields:
    if str(field.name) not in ['customerid', "overall_count", "overall_amount"]:
        name = str(field.name)
        temp_result = temp_result.withColumn(name, col(name)/col("overall_count"))
temp_result.show()

数据应该是这个样子的：

现在，我想通过 overall_count 计算加权平均值，该怎么做？

结果应为区域 A 的 (0.66*3+1*1)/4，区域 B 的 (0.33*3+1*1)/4

我的想法：

当然可以把数据转成 Python/Pandas 来完成一些计算，但在什么情况下应该使用 Pyspark 呢？

我可以得到类似于以下的结果

temp_result.agg(sum(col("Region A") * col("overall_count")), sum(col("Region B")*col("overall_count"))).show()

但是这种方法并不太合适，特别是在需要计算多个region的情况下。

- cqcn1991

您可以参考我之前提出的问题：https://dev59.com/Eqfja4cB1Zd3GeqPsj6_ - pissall

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- shadow_dev · Accepted Answer

你可以通过将上述步骤分成多个阶段来实现加权平均值。

考虑以下内容：

Dataframe Name: sales_table
[ total_sales, count_of_orders, location]
[     50     ,       9        ,    A    ]
[     80     ,       4        ,    A    ]
[     90     ,       7        ,    A    ]

计算以上数据的分组加权平均值需要分为两步：

将sales乘以importance
聚合sales_x_count的乘积
将sales_x_count除以原始数据的总和

如果我们在PySpark代码中将上述过程分成几个阶段，你可以得到以下结果：

new_sales = sales_table \
    .withColumn("sales_x_count", col("total_sales") * col("count_orders")) \
    .groupBy("Location") \
    .agg(sf.sum("total_sales").alias("sum_total_sales"), \
         sf.sum("sales_x_count").alias("sum_sales_x_count")) \
    .withColumn("count_weighted_average", col("sum_sales_x_count") / col("sum_total_sales"))

所以...这里实际上并不需要花哨的UDF（它可能会减慢你的速度）。最初的回答。