当ID匹配时，将Pyspark数据框按列除以其他Pyspark数据框的列

Question

当ID匹配时，将Pyspark数据框按列除以其他Pyspark数据框的列

16

我有一个 PySpark DataFrame，名为 df1，它的形式如下：

CustomerID  CustomerValue
12          .17
14          .15
14          .25
17          .50
17          .01
17          .35

我有一个第二个 PySpark DataFrame，名为 df2，它是通过按 CustomerID 进行分组并使用 sum 函数进行聚合的 df1。它看起来像这样:

我有一个第二个 PySpark DataFrame，df2，它是通过将 df1 按 CustomerID 分组并使用 sum 函数进行聚合得到的。它长这样：

 CustomerID  CustomerValueSum
 12          .17
 14          .40
 17          .86

我想在df1中添加第三列，该列为df1['CustomerValue']除以df2['CustomerValueSum']相同的CustomerIDs。这将看起来像这样：

CustomerID  CustomerValue  NormalizedCustomerValue
12          .17            1.00
14          .15            .38
14          .25            .62
17          .50            .58
17          .01            .01
17          .35            .41

换句话说，我正在尝试将这段 Python/Pandas 代码转换为 PySpark：

normalized_list = []
for idx, row in df1.iterrows():
    (
        normalized_list
        .append(
            row.CustomerValue / df2[df2.CustomerID == row.CustomerID].CustomerValueSum
        )
    )
df1['NormalizedCustomerValue'] = [val.values[0] for val in normalized_list]

我该怎么做？

- TrentWoodbury

2个回答

5

这也可以通过使用Spark窗口函数来实现，这样您不需要创建具有聚合值的单独数据帧(df2)：

为输入数据帧创建数据：

from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

data =[(12, 0.17), (14, 0.15), (14, 0.25), (17, 0.5), (17, 0.01), (17, 0.35)]
df1 = sqlContext.createDataFrame(data, ['CustomerID', 'CustomerValue'])
df1.show()
+----------+-------------+
|CustomerID|CustomerValue|
+----------+-------------+
|        12|         0.17|
|        14|         0.15|
|        14|         0.25|
|        17|          0.5|
|        17|         0.01|
|        17|         0.35|
+----------+-------------+

Defining a Window partitioned by CustomerID:

from pyspark.sql import Window
from pyspark.sql.functions import sum

w = Window.partitionBy('CustomerID')

df2 = df1.withColumn('NormalizedCustomerValue', df1.CustomerValue/sum(df1.CustomerValue).over(w)).orderBy('CustomerID')

df2.show()
+----------+-------------+-----------------------+
|CustomerID|CustomerValue|NormalizedCustomerValue|
+----------+-------------+-----------------------+
|        12|         0.17|                    1.0|
|        14|         0.15|    0.37499999999999994|
|        14|         0.25|                  0.625|
|        17|          0.5|     0.5813953488372093|
|        17|         0.01|   0.011627906976744186|
|        17|         0.35|     0.4069767441860465|
+----------+-------------+-----------------------+

- Abhishek Bansal

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- dfernig · Accepted Answer

代码：

import pyspark.sql.functions as F

df1 = df1\
    .join(df2, "CustomerID")\
    .withColumn("NormalizedCustomerValue", (F.col("CustomerValue") / F.col("CustomerValueSum")))\
    .drop("CustomerValueSum")

输出：

df1.show()

+----------+-------------+-----------------------+
|CustomerID|CustomerValue|NormalizedCustomerValue|
+----------+-------------+-----------------------+
|        17|          0.5|     0.5813953488372093|
|        17|         0.01|   0.011627906976744186|
|        17|         0.35|     0.4069767441860465|
|        12|         0.17|                    1.0|
|        14|         0.15|    0.37499999999999994|
|        14|         0.25|                  0.625|
+----------+-------------+-----------------------+