以下是使用Scala的一种方法:
val df = Seq(
(0,"A,B"),
(1,"A,C"),
(2,"B"))
.toDF("id","value")
val withArrayDF = df.withColumn("array", split($"value", ",")).drop("value")
val distinctValues = withArrayDF.select(explode($"array")).distinct.collect.map{_.getString(0)}.sorted.toList
distinctValues.map{ncol =>
withArrayDF.withColumn(ncol, array_contains($"array", ncol)).drop("array")
}.reduce(_.join(_,"id"))
.select("id", distinctValues:_*)
.show
并且输出结果为:
+
| id| A| B| C|
+
| 0| true| true|false|
| 1| true|false| true|
| 2|false| true|false|
+
"最初的回答":而且Python版本为:
from pyspark.sql.functions import array_contains, split, when, col, explode
from functools import reduce
df = spark.createDataFrame(
[(0,"A,B"),
(1,"A,C"),
(2,"B")], ["id","value"])
withArrayDF = df.withColumn("array", split(df["value"], ",")).drop("value")
distinctValues = sorted(
list(
map(lambda row: row[0], withArrayDF.select(explode("array")).distinct().collect())))
mappedDFs = list(
map(lambda ncol:
withArrayDF
.withColumn(ncol, array_contains(col("array"), ncol))
.drop("array"),
distinctValues
))
finalDF = reduce(lambda x,y: x.join(y, "id"), mappedDFs)
finalDF.show()
输出:
+
| id| A| B| C|
+
| 0| true| true|false|
| 1| true|false| true|
| 2|false| true|false|
+