我假设你的例子中
varA,varB,varC,varD 的大小保持不变。
scala> case class Input(user_id : Integer,someString : String, varA : Array[Integer],varB : Array[Integer],varC : Array[String], varD : Array[String])
defined class Input
scala> case class Result(user_id : Integer,someString : String , varA : Integer,varB : Integer,varC : String, varD : String)
defined class Result
scala> val obj1 = Input(1,"example1",Array(0,2,5),Array(1,2,9),Array("a","b","c"),Array("red","green","yellow"))
obj1: Input = Input(1,example1,[Ljava.lang.Integer;@77c43ec2,[Ljava.lang.Integer;@3a332d08,[Ljava.lang.String;@5c1222da,[Ljava.lang.String;@114e051a)
scala> val obj2 = Input(2,"example2",Array(1,20,5),Array(9,null,6),Array("d","e","f"),Array("white","black","cyan"))
obj2: Input = Input(2,example2,[Ljava.lang.Integer;@326db38,[Ljava.lang.Integer;@50914458,[Ljava.lang.String;@339b73ae,[Ljava.lang.String;@1567ee0a)
scala> val input_df = sc.parallelize(Seq(obj1,obj2)).toDS
input_df: org.apache.spark.sql.Dataset[Input] = [user_id: int, someString: string ... 4 more fields]
scala> input_df.show
+-------+----------+----------+------------+---------+--------------------+
|user_id|someString| varA| varB| varC| varD|
+-------+----------+----------+------------+---------+--------------------+
| 1| example1| [0, 2, 5]| [1, 2, 9]|[a, b, c]|[red, green, yellow]|
| 2| example2|[1, 20, 5]|[9, null, 6]|[d, e, f]|[white, black, cyan]|
+-------+----------+----------+------------+---------+--------------------+
scala> def getResult(row : Input) : Iterable[Result] = {
| val user_id = row.user_id
| val someString = row.someString
| val varA = row.varA
| val varB = row.varB
| val varC = row.varC
| val varD = row.varD
| val seq = for( i <- 0 until varA.size) yield {Result(user_id,someString,varA(i),varB(i),varC(i),varD(i))}
| seq.toSeq
| }
getResult: (row: Input)Iterable[Result]
scala> val resdf = input_df.flatMap{row => getResult(row)}
resdf: org.apache.spark.sql.Dataset[Result] = [user_id: int, someString: string ... 4 more fields]
scala> resdf.show
+-------+----------+----+----+----+------+
|user_id|someString|varA|varB|varC| varD|
+-------+----------+----+----+----+------+
| 1| example1| 0| 1| a| red|
| 1| example1| 2| 2| b| green|
| 1| example1| 5| 9| c|yellow|
| 2| example2| 1| 9| d| white|
| 2| example2| 20|null| e| black|
| 2| example2| 5| 6| f| cyan|
+-------+----------+----+----+----+------+
如果变量varA、varB、varC或varD的大小不同,则需要处理这些情况。
您可以迭代最大大小,并通过处理异常输出空值,如果任何列中不存在值。
case class Input(user_id: Integer, someString: String, varA: Array[Integer], varB: Array[Integer], varC: Array[String], varD: Array[String])
中创建的数组呢? - Mohd Zoubi