基于列对Spark DataFrame进行分区

Question

基于列对Spark DataFrame进行分区

3

我试图使用groupByKey()基于列“b”来对Spark DataFrame进行分区，但最终在同一分区中有不同的组。

以下是数据帧和我使用的代码：

df:
+---+---+
|  a|  b|
+---+---+
|  4|  2|
|  5|  1|
|  1|  4|
|  2|  2|
+---+---+


 val partitions = df.map(x => x.getLong(1)).distinct().count().toInt
 val df2 = df.map(r => (r.getLong(1), r)).groupByKey(partitions)
 val gb = df2.mapPartitions(iterator => {
            val rows = iterator.toList
            println(rows)
            iterator
            })

The printed rows are:
Partition 1: List((2,CompactBuffer([4,2], [2,2])))
Partition 2: List((4,CompactBuffer([1,4])), (1,CompactBuffer([5,1])))

4组和1组在同一分区（2），我想把它们放在不同的分区中，你知道怎么做吗？

Desired output:
Partition 1: List((2,CompactBuffer([4,2], [2,2])))
Partition 2: List((4,CompactBuffer([1,4])))
Partition 3: List((1,CompactBuffer([5,1])))

补充一下背景，我需要从拥有相同特定列值的所有其他行中获取数据来更新DataFrame中的行。因此，仅使用map()是不够的，我目前正在尝试使用mapPartitions()，其中每个分区都包含具有特定列相同值的所有行。如果您知道更好的方法，请不要犹豫告诉我 :)

非常感谢！

ClydeX

- ClydeX

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Andreas Ryge · Accepted Answer

看起来你想要做的事情可以通过使用窗口函数来实现：https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html