如何在pyspark dataframe中过滤掉null值？

Question

如何在pyspark dataframe中过滤掉null值？

6

假设我们有一个简单的数据框：

from pyspark.sql.types import *

schema = StructType([
StructField('id', LongType(), False),
StructField('name', StringType(), False),
StructField('count', LongType(), True),
])
df = spark.createDataFrame([(1,'Alice',None), (2,'Bob',1)], schema)

问题是如何检测空值？我尝试了以下方法：

df.where(df.count == None).show()
df.where(df.count is 'null').show()
df.where(df.count == 'null').show()

发生错误：

condition should be string or Column

我知道以下内容是可行的：

df.where("count is null").show()

但是有没有一种方法可以在不使用完整字符串的情况下实现呢？例如 df.count ...？

- Miroslav Stola

2个回答

8

你可以使用Spark函数isnull来实现。该函数可用于检查数据中是否存在null值。

from pyspark.sql import functions as F
df.where(F.isnull(F.col("count"))).show()

或者直接使用方法isNull

df.where(F.col("count").isNull()).show()

- Steven

2

对于那些不熟悉pyspark语法的人，比如我，.isNotNull()会给你所有非空值。 - labyrinth

~F.col("count").isNull() 也可以提供否定。 - Wassadamo

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ramesh Maharjan · Accepted Answer

9

使用filter api也可以实现同样的功能。

from pyspark.sql import functions as F
df.filter(F.isnull("count")).show()

- Ramesh Maharjan

“where”和“filter”之间有显著的区别吗？我的意思是一般情况下，不仅仅是在这种情况下。 - Miroslav Stola

5

@MiroslavStola，“where”是“filter”的别名。“filter”是函数式编程中的标准用法，而“where”则更适用于那些喜欢使用SQL方法的人。 - Ramesh Maharjan