在PySpark中,我有一个由两列组成的数据框:
我希望添加一个列
但是我遇到了错误:
+-----------+----------------------+
| str1 | array_of_str |
+-----------+----------------------+
| John | [mango, apple, ... |
| Tom | [mango, orange, ... |
| Matteo | [apple, banana, ... |
我希望添加一个列
concat_result
,该列包含array_of_str
中每个元素与str1
列内字符串的拼接结果。+-----------+----------------------+----------------------------------+
| str1 | array_of_str | concat_result |
+-----------+----------------------+----------------------------------+
| John | [mango, apple, ... | [mangoJohn, appleJohn, ... |
| Tom | [mango, orange, ... | [mangoTom, orangeTom, ... |
| Matteo | [apple, banana, ... | [appleMatteo, bananaMatteo, ... |
我正在尝试使用 map
来迭代数组:
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, ArrayType
# START EXTRACT OF CODE
ret = (df
.select(['str1', 'array_of_str'])
.withColumn('concat_result', F.udf(
map(lambda x: x + F.col('str1'), F.col('array_of_str')), ArrayType(StringType))
)
)
return ret
# END EXTRACT OF CODE
但是我遇到了错误:
TypeError: argument 2 to map() must support iteration
udf
函数 - (除非你在Spark 2.4+中)。 - pault