如何只打印特定分区(例如第5个)的元素?
val distData = sc.parallelize(1 to 50, 10)
val distData = sc.parallelize(1 to 50, 10)
使用 Spark/Scala:
val data = 1 to 50
val distData = sc.parallelize(data,10)
distData.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) =>it.toList.map(x => if (index ==5) {println(x)}).iterator).collect
生成:
26
27
28
29
30
>>> rdd = sc.parallelize([1, 2, 3, 4], 2)
>>> rdd.glom().collect()
[[1, 2], [3, 4]]
>>> rdd.glom().collect()[1]
[3, 4]
编辑:Scala示例:
scala> val distData = sc.parallelize(1 to 50, 10)
scala> distData.glom().collect()(4)
res2: Array[Int] = Array(21, 22, 23, 24, 25)
您可以使用计数器来对foreachPartition() API进行操作,以实现此目的。
以下是一个Java程序,用于打印每个分区的内容 JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD<Integer> myArray = context.parallelize(Arrays.asList(1,2,3,4,5,6,7,8,9));
JavaRDD<Integer> partitionedArray = myArray.repartition(2);
System.out.println("partitioned array size is " + partitionedArray.count());
partitionedArray.foreachPartition(new VoidFunction<Iterator<Integer>>() {
public void call(Iterator<Integer> arg0) throws Exception {
while(arg0.hasNext()) {
System.out.println(arg0.next());
}
}
});