在Pyspark中，如何针对列值为单词列表的行进行过滤？

Question

在Pyspark中，如何针对列值为单词列表的行进行过滤？

3

我正在使用pyspark dataframe。我有一个名为words（array<string>）的列，如下所示：

+---+--------------------------------------------------------------------------------+
| id|                                                                           words|                      
----+--------------------------------------------------------------------------------+
|012|[content, type, multipart, alternative, boundaries, nextpart, nextpart, drama,..|
|013|[received, from, am5eur02ht120, eop, eur02, prod, protection, outlook by, pro...|
|014|[data, care, much, important, information, summer, care, send, faraway, forget..|

此外，我有一个单词列表：

list = ["protection", "content", "received"]

我希望筛选出含有列表中任何值的行。

期望输出:

+---+--------------------------------------------------------------------------------+
| id|                                                                           words|                      
----+--------------------------------------------------------------------------------+
|012|[content, type, multipart, alternative, boundaries, nextpart, nextpart, drama,..|
|013|[received, from, am5eur02ht120, eop, eur02, prod, protection, outlook by, pro...|

- Samiksha

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alex Ott · Accepted Answer

我认为你可以使用array_intersect函数和size函数来实现你需要的功能，类似这样（未测试，不确定F.lit(list)的确切代码）：

>>> import pyspark.sql.functions as F

>>> df.show()
+----------------+
|           words|
+----------------+
|[content, word2]|
|      [111, 222]|
+----------------+

>>> list_col = F.array(*[F.lit(cl) for cl in list])
>>> df.filter(F.size(F.array_intersect(F.col("words"), list_col)) > 0).show()
+----------------+
|           words|
+----------------+
|[content, word2]|
+----------------+