PySpark数据框架 - 如何将字符串变量传递给df.where()条件

Question

PySpark数据框架 - 如何将字符串变量传递给df.where()条件

3

我不确定这是否在pyspark中可行。我相信应该只是我没有赢得胜利:(。

要求：检索任何FNAME和LNAME为空或0的记录

期望结果：返回前两行作为结果。

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(fileName)
df.show()

+------+-------+------+
| FNAME|  LNAME|  CITY|
+------+-------+------+
|     0|   null|    NY|
|  null|      0|  null|
|   Joe|   null|    LA|
|  null|   Deon|    SA|
| Steve|   Mark|  null|
+------+-------+------+

colCondition = []
for col in df.columns:
    condition = '(df.'+col+'.isNull() | df.'+col+' == 0)'
    colCondition.append(condition)

dfWhereConditon = ' & '.join(colList)

我希望你能够帮我实现以下目标：

df.where(dfWhereConditon)

这种方法不起作用是因为where条件中dfWhereCondition被视为字符串。我该如何解决这个问题，或者有更好的方法实现此功能。

谢谢

- just10minutes

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- MaFF · Accepted Answer

如果您想使用字符串条件，可以使用SQL过滤器子句：

condition = ' AND '.join(['('+ col + ' IS NULL OR ' + col + ' = 0)' for col in df.columns])
df.filter(condition)