在Pyspark DataFrame中创建一个特定类型的空数组列

Question

在Pyspark DataFrame中创建一个特定类型的空数组列

12

我试图给一个df添加一个空数组字符串的数组列，但最终我添加了一个字符串的数组列。

我尝试了这种方法：

import pyspark.sql.functions as F
df = df.withColumn('newCol', F.array([]))

我该如何在pyspark中完成这个任务？

- David Taub

2个回答

8

这是其中一种方法：

>>> import pyspark.sql.functions as F
>>> myList = [('Alice', 1)]
>>> df = spark.createDataFrame(myList)
>>> df.schema
StructType(List(StructField(_1,StringType,true),StructField(_2,LongType,true)))
>>> df = df.withColumn('temp', F.array()).withColumn("newCol", F.array("temp")).drop("temp")
>>> df.schema
StructType(List(StructField(_1,StringType,true),StructField(_2,LongType,true),StructField(newCol,ArrayType(ArrayType(StringType,false),false),false)))
>>> df
DataFrame[_1: string, _2: bigint, newCol: array<array<string>>]
>>> df.collect()
[Row(_1=u'Alice', _2=1, newCol=[[]])]

- moriarty007

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- David Zhao · Accepted Answer

实现一个空数组列的另一种方法：

import pyspark.sql.functions as F
df = df.withColumn('newCol', F.array(F.array()))

因为F.array()默认为字符串数组类型，所以newCol列的类型将为ArrayType(ArrayType(StringType,false),false)。如果你需要内部数组的类型不是字符串，你可以直接将内部的F.array()进行类型转换，方法如下。

import pyspark.sql.functions as F
import pyspark.sql.types as T
int_array_type = T.ArrayType(T.IntegerType())  # "array<integer>" also works
df = df.withColumn('newCol', F.array(F.array().cast(int_array_type)))

在这个例子中，newCol 的类型将是 ArrayType(ArrayType(IntegerType,true),false)。