我有一个聚合Spark dataframe数据的需求,使用scala语言进行实现。
同时,我有两个数据集。
数据集1包含每种“t”类型的值(val1,val2等),分布在不同的列中,如(t1,t2…) 。
现在,我需要按(id,t,t*)字段进行分组,并将sum(val)和sum(val*)的余额作为单独的记录打印出来。两个余额应该相等。
数据集1包含每种“t”类型的值(val1,val2等),分布在不同的列中,如(t1,t2…) 。
val data1 = Seq(
("1","111",200,"221",100,"331",1000),
("2","112",400,"222",500,"332",1000),
("3","113",600,"223",1000,"333",1000)
).toDF("id1","t1","val1","t2","val2","t3","val3")
data1.show()
+---+---+----+---+----+---+----+
|id1| t1|val1| t2|val2| t3|val3|
+---+---+----+---+----+---+----+
| 1|111| 200|221| 100|331|1000|
| 2|112| 400|222| 500|332|1000|
| 3|113| 600|223|1000|333|1000|
+---+---+----+---+----+---+----+
数据集2通过为每个“t”类型单独创建一行来表示相同的内容。
val data2 = Seq(("1","111",200),("1","221",100),("1","331",1000),
("2","112",400),("2","222",500),("2","332",1000),
("3","113",600),("3","223",1000), ("3","333",1000)
).toDF("id*","t*","val*")
data2.show()
+---+---+----+
|id*| t*|val*|
+---+---+----+
| 1|111| 200|
| 1|221| 100|
| 1|331|1000|
| 2|112| 400|
| 2|222| 500|
| 2|332|1000|
| 3|113| 600|
| 3|223|1000|
| 3|333|1000|
+---+---+----+
现在,我需要按(id,t,t*)字段进行分组,并将sum(val)和sum(val*)的余额作为单独的记录打印出来。两个余额应该相等。
My output should look like below:
+---+---+--------+---+---------+
|id1| t |sum(val)| t*|sum(val*)|
+---+---+--------+---+---------+
| 1|111| 200|111| 200|
| 1|221| 100|221| 100|
| 1|331| 1000|331| 1000|
| 2|112| 400|112| 400|
| 2|222| 500|222| 500|
| 2|332| 1000|332| 1000|
| 3|113| 600|113| 600|
| 3|223| 1000|223| 1000|
| 3|333| 1000|333| 1000|
+---+---+--------+---+---------+
我考虑将数据集1按照“t”类型分解成多个记录,然后与数据集2进行连接。 但是你能否提供更好的方法,避免数据集变大时影响性能?