在PySpark中的比较运算符（不等于/！=）

Question

在PySpark中的比较运算符（不等于/！=）

sqlapache-sparkpysparknullapache-spark-sql

23

我试图获取一个包含两个标志都设置为“1”的数据框中的所有行，以及那些只有其中一个标志设置为“1”，而另一个标志不等于“1”的行。

使用以下模式（三列），

df = sqlContext.createDataFrame([('a',1,'null'),('b',1,1),('c',1,'null'),('d','null',1),('e',1,1)], #,('f',1,'NaN'),('g','bla',1)],
                            schema=('id', 'foo', 'bar')
                            )

我获得了以下数据帧：

+---+----+----+
| id| foo| bar|
+---+----+----+
|  a|   1|null|
|  b|   1|   1|
|  c|   1|null|
|  d|null|   1|
|  e|   1|   1|
+---+----+----+

当我应用所需的过滤器时，第一个过滤器（foo=1 AND bar=1）有效，但是另一个过滤器（foo=1 AND NOT bar=1）无效。

foobar_df = df.filter( (df.foo==1) & (df.bar==1) )

产生：

+---+---+---+
| id|foo|bar|
+---+---+---+
|  b|  1|  1|
|  e|  1|  1|
+---+---+---+

这里是不起作用的过滤器：

foo_df = df.filter( (df.foo==1) & (df.bar!=1) )
foo_df.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
+---+---+---+

为什么它没有过滤？我该如何获取仅包含foo等于'1'的列？

- Hendrik F

2个回答

22

要过滤 null 值，请尝试:

foo_df = df.filter( (df.foo==1) & (df.bar.isNull()) )

https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isNull

- johnaphun

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- zero323 · Accepted Answer

为什么它没有过滤掉

因为这是SQL，NULL 表示缺失值。因此，与 NULL 进行任何比较，除了 IS NULL 和 IS NOT NULL，都是未定义的。你需要使用以下方法之一：

col("bar").isNull() | (col("bar") != 1)

或者

coalesce(col("bar") != 1, lit(True))

或者 (PySpark >= 2.3)：

col("bar").eqNullSafe(1)

如果您想在PySpark中进行null安全比较。

此外，'null'不是引入NULL字面值的有效方式。您应该使用None来指示缺少对象。

from pyspark.sql.functions import col, coalesce, lit

df = spark.createDataFrame([
    ('a', 1, 1), ('a',1, None), ('b', 1, 1),
    ('c' ,1, None), ('d', None, 1),('e', 1, 1)
]).toDF('id', 'foo', 'bar')

df.where((col("foo") == 1) & (col("bar").isNull() | (col("bar") != 1))).show()

## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## |  a|  1|null|
## |  c|  1|null|
## +---+---+----+

df.where((col("foo") == 1) & coalesce(col("bar") != 1, lit(True))).show()

## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## |  a|  1|null|
## |  c|  1|null|
## +---+---+----+