我希望验证一个数组是否包含一个字符串在Pyspark中(Spark < 2.4)。
示例数据框:
column_1 <Array> | column_2 <String>
--------------------------------------------
["2345","98756","8794"] | 8794
--------------------------------------------
["8756","45678","987563"] | 1234
--------------------------------------------
["3475","8956","45678"] | 3475
--------------------------------------------
我希望比较两列column_1和column_2。如果column_1中包含column_2,则应跳过column_1的值。我编写了一个UDF以从column_1中减去column_2,但它没有起作用。
def contains(x, y):
try:
sx, sy = set(x), set(y)
if len(sx) == 0:
return sx
elif len(sy) == 0:
return sx
else:
return sx - sy
# in exception, for example `x` or `y` is None (not a list)
except:
return sx
udf_contains = udf(contains, 'string')
new_df = my_df.withColumn('column_1', udf_contains(my_df.column_1, my_df.column_2))
期望结果:
column_1 <Array> | column_2 <String>
--------------------------------------------------
["2345","98756"] | 8794
--------------------------------------------------
["8756","45678","987563"] | 1234
--------------------------------------------------
["8956","45678"] | 3475
--------------------------------------------------
如何处理 column_1 为空数组且 column_2 为 null 的情况?谢谢。
udf_contains = udf(lambda x,y: [e for e in x if e != y], 'array<string>')
- jxc