Apache Spark上的不相交集合

Question

Apache Spark上的不相交集合

algorithmapache-sparkmapreducegraph-theorydisjoint-sets

9

我正在尝试找到使用Apache Spark在大量数据上搜索不相交集合(连接组件/并查集)的算法。问题在于数据量很大，即使是图形顶点的原始表示形式也无法适应单个计算机的内存，边缘也不能满足内存要求。

源数据是HDFS上的图形边缘文本文件: "id1 \ tid2"。

id以字符串值而非整数表示。

我发现的朴素解决方案是:

取边缘的rdd -> [id1:id2] [id3:id4] [id1:id3]
按键分组的边缘 -> [id1:[id2;id3]][id3:[id4]]
对于每个记录将最小id设置为每个组 -> (flatMap) [id1:id1][id2:id1][id3:id1][id3:id3][id4:id3]
从第3阶段反向rdd [id2:id1] -> [id1:id2]
来自第3和4个阶段的rdd的leftOuterJoin
在第3步中rdd的大小不变的情况下重复从第2个阶段开始

但这会导致各节点之间传输大量的数据（洗牌）。

有什么建议？

- Puh

我认为GraphX应该已经内置了您所需的功能（链接：http://spark.apache.org/graphx/）。 - James Tobin

2个回答

0

除了@Marsellus Wallace的答案之外，以下是使用GraphX从边的RDD获取不相交集的完整代码。

val edges:RDD[(Long,Long)] = ???

val g = Graph.fromEdgeTuples(edges,-1L)

val disjointSets:RDD[Iterable[Long]] = g.connectedComponents()
  //Get tuples with (vertexId,parent vertexId)
  .vertices
  //Group by parent vertex Id so it aggregates the disjoint set
  .groupBy(_._2)
  .values
  .map(_.map(_._1))

- mberchon

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Marsellus Wallace · Accepted Answer

如果您正在处理图形相关的工作，我建议您查看以下这些库之一:

它们都提供了连接组件算法的开箱即用功能。 GraphX:

val graph: Graph = ...
val cc = graph.connectedComponents().vertices

GraphFrames:

val graph: GraphFrame = ...
val cc = graph.connectedComponents.run()
cc.select("id", "component").orderBy("component").show()