如何在Spark/PySpark中对包含空值的数据框的两列求和？

Question

如何在Spark/PySpark中对包含空值的数据框的两列求和？

5

我有一个数据框，格式如下 -

Col1    |cnt_Test1     |cnt_Test2
_______________________________________
Stud1   | null        | 2
Stud2   | 3           | 4
Stud3   | 1           | null

我希望创建一个新的列，通过聚合cnt_Test1和cnt_Test2来得到以下结果 -

Col1    |cnt_Test1     |cnt_Test2     | new_Count
____________________________________________________
Stud1   | null        | 2              | 2
Stud2   | 3           | 4              | 7
Stud3   | 1           | null           | 1

然而，我得到了以下输出 - 将null和长整数相加的结果为null

Col1    |cnt_Test1     |cnt_Test2     | new_Count
____________________________________________________
Stud1   | null        | 2              | null
Stud2   | 3           | 4              | 7
Stud3   | 1           | null           | null

- Amit Pandey

是的，确实有。谢谢，我在发帖之前尝试查找过，但没有找到。 - Amit Pandey

@AmitPandey 如果给出的答案满足了您的问题，您可以请接受并点赞该答案。 - User12345

2个回答

1

你也可以分成两步来完成：

df2 = df.na.fill(0)
df2.withColumn("new_Count", df2["cnt_Test1"] + df2["cnt_Test2"]).show()

- CyberPunk

1

它可以工作，但这将改变原始数据的性质。必须保留空值以便我们可以计算平均值。 - Amit Pandey

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- User12345 · Accepted Answer

你需要像下面这样使用coalesce函数。

df = spark.createDataFrame(
[
("Stud1",None,2),
("Stud1",3,4),
("Stud1",1, None)], 
("col1","cnt_Test1", "cnt_Test2"))


# Import functions
import pyspark.sql.functions as f

df1 = df.withColumn("new_count", f.coalesce(f.col('cnt_Test1'), f.lit(0)) + f.coalesce(f.col('cnt_Test2'), f.lit(0)))