将Spark数据框中的字符串列拆分为多个布尔列

3
我们有一个类似于这样的Spark数据框架:
   id | value 
------+--------
    0 | A,B
    1 | A,C
    2 | B

我们希望将其转化为:
   id | A     | B     | C
------+-------+-------+-------+
    0 | True  | True  | False | 
    1 | True  | False | True  | 
    2 | False | True  | False | 

什么是最佳的转换方式?
2个回答

4
假设以下是您的输入数据框:
df = spark.createDataFrame([(0,["A","B"]),(1,["A","C"]),(2, ["B"])],["id","value"])

然后使用 explodepivot,您将获得一个包含整数和空值的表格。

df2 = df.withColumn("x",explode(df.value)).drop("value").groupBy("id").pivot("x").count()
df2.show()

+---+----+----+----+
| id|   A|   B|   C|
+---+----+----+----+
|  0|   1|   1|null|
|  1|   1|null|   1|
|  2|null|   1|null|
+---+----+----+----+

最终,您只需要将值转换为布尔值,例如:
for col_name in df2.columns[1:]:
    df2 = df2.withColumn(col_name, col(col_name).isNotNull())

df2.show()

+---+-----+-----+-----+
| id|    A|    B|    C|
+---+-----+-----+-----+
|  0| true| true|false|
|  1| true|false| true|
|  2|false| true|false|
+---+-----+-----+-----+

2

以下是使用Scala的一种方法:

val df = Seq(
(0,"A,B"),
(1,"A,C"),
(2,"B"))
.toDF("id","value")

    //store array from split
    val withArrayDF = df.withColumn("array", split($"value", ",")).drop("value")

    //get sorted unique values for the whole dataset
    val distinctValues = withArrayDF.select(explode($"array")).distinct.collect.map{_.getString(0)}.sorted.toList

    //foreach A,B,C create new column called ncol. When ncol is present in array(i) true otherwise false 
    distinctValues.map{ncol =>
      withArrayDF.withColumn(ncol, array_contains($"array", ncol)).drop("array")
    }.reduce(_.join(_,"id"))//join all A, B, C 
     .select("id", distinctValues:_*)
     .show

并且输出结果为:

+---+-----+-----+-----+ 
| id|    A|    B|    C| 
+---+-----+-----+-----+ 
|  0| true| true|false| 
|  1| true|false| true| 
|  2|false| true|false| 
+---+-----+-----+-----+ 

"最初的回答":而且Python版本为:
from pyspark.sql.functions import array_contains, split, when, col, explode
from functools import reduce

df = spark.createDataFrame(
[(0,"A,B"),
(1,"A,C"),
(2,"B")], ["id","value"])

# store array from split
withArrayDF = df.withColumn("array", split(df["value"], ",")).drop("value")

# get sorted unique values for the whole dataset
distinctValues = sorted(
                    list(
                      map(lambda row: row[0], withArrayDF.select(explode("array")).distinct().collect())))


# foreach A,B,C create new column called ncol. When ncol is present in array(i) true otherwise false 
mappedDFs = list(
              map(lambda ncol: 
                  withArrayDF
                          .withColumn(ncol, array_contains(col("array"), ncol))
                          .drop("array"), 
                  distinctValues
                 ))

finalDF = reduce(lambda x,y: x.join(y, "id"), mappedDFs)
finalDF.show()

输出:

+---+-----+-----+-----+ 
| id|    A|    B|    C| 
+---+-----+-----+-----+ 
|  0| true| true|false| 
|  1| true|false| true| 
|  2|false| true|false| 
+---+-----+-----+-----+ 

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接