我正在尝试从一个Python列表的列表中删除一个元素:
+---------------+
| sources|
+---------------+
| [62]|
| [7, 32]|
| [62]|
| [18, 36, 62]|
|[7, 31, 36, 62]|
| [7, 32, 62]|
我希望能够从以上列表中的每个列表中删除一个元素rm
。我编写了一个可以对列表中的列表执行此操作的函数:
def asdf(df, rm):
temp = df
for n in range(len(df)):
temp[n] = [x for x in df[n] if x != rm]
return(temp)
这将去除rm = 1
:
a = [[1,2,3],[1,2,3,4],[1,2,3,4,5]]
In: asdf(a,1)
Out: [[2, 3], [2, 3, 4], [2, 3, 4, 5]]
但是我无法将其用于DataFrame:
asdfUDF = udf(asdf, ArrayType(IntegerType()))
In: df.withColumn("src_ex", asdfUDF("sources", 32))
Out: Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.lang.Integer]) does not exist
期望的行为:
In: df.withColumn("src_ex", asdfUDF("sources", 32))
Out:
+---------------+
| src_ex|
+---------------+
| [62]|
| [7]|
| [62]|
| [18, 36, 62]|
|[7, 31, 36, 62]|
| [7, 62]|
除了将新列添加到 PySpark DataFrame(
df
)之外,您有任何建议或想法吗?