如何在Spark数据框中仅针对特定字段使用“cube”？

Question

如何在Spark数据框中仅针对特定字段使用“cube”？

scalaapache-sparkdataframeapache-spark-sqlcube

7

我正在使用Spark 1.6.1，我有一个这样的数据框。

+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+
|     scene_id|  action_id|       classifier|os_name|country|app_ver|   p0value|p1value|p2value|p3value|p4value|
+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+
|    test_home|scene_enter|        test_home|android|     KR|  5.6.3|__OTHERS__|  false|   test|   test|   test|
......

我希望能够通过使用立方操作获取以下数据框：

（按所有字段分组，但仅对“os_name”、“country”和“app_ver”字段进行立方）

+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+---+
|     scene_id|  action_id|       classifier|os_name|country|app_ver|   p0value|p1value|p2value|p3value|p4value|cnt|
+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+---+
|    test_home|scene_enter|        test_home|android|     KR|  5.6.3|__OTHERS__|  false|   test|   test|   test|  9|
|    test_home|scene_enter|        test_home|   null|     KR|  5.6.3|__OTHERS__|  false|   test|   test|   test| 35|
|    test_home|scene_enter|        test_home|android|   null|  5.6.3|__OTHERS__|  false|   test|   test|   test| 98|
|    test_home|scene_enter|        test_home|android|     KR|   null|__OTHERS__|  false|   test|   test|   test|101|
|    test_home|scene_enter|        test_home|   null|   null|  5.6.3|__OTHERS__|  false|   test|   test|   test|301|
|    test_home|scene_enter|        test_home|   null|     KR|   null|__OTHERS__|  false|   test|   test|   test|225|
|    test_home|scene_enter|        test_home|android|   null|   null|__OTHERS__|  false|   test|   test|   test|312|
|    test_home|scene_enter|        test_home|   null|   null|   null|__OTHERS__|  false|   test|   test|   test|521|
......

我尝试过以下方法，但它似乎很慢且难看...

var cubed = df
  .cube($"scene_id", $"action_id", $"classifier", $"country", $"os_name", $"app_ver", $"p0value", $"p1value", $"p2value", $"p3value", $"p4value")
  .count
  .where("scene_id IS NOT NULL AND action_id IS NOT NULL AND classifier IS NOT NULL AND p0value IS NOT NULL AND p1value IS NOT NULL AND p2value IS NOT NULL AND p3value IS NOT NULL AND p4value IS NOT NULL")

有更好的解决方案吗？

- Yoon-seop Choe

谢谢，但是cube操作生成了NULL值 @CarlosVilchez... - Yoon-seop Choe

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- zero323 · Accepted Answer

我相信你无法完全避免问题，但有一个简单的方法可以减少问题的规模。这个想法是用一个占位符替换所有不应该被边缘化的列。

例如，如果你有一个DataFrame：

val df = Seq((1, 2, 3, 4, 5, 6)).toDF("a", "b", "c", "d", "e", "f")

如果您对由d和e边缘化并以a..c分组的立方体感兴趣，您可以将a..c的替代定义为：

import org.apache.spark.sql.functions.struct
import sparkSql.implicits._

// alias here may not work in Spark 1.6
val rest = struct(Seq($"a", $"b", $"c"): _*).alias("rest")

和 立方体：

val cubed =  Seq($"d", $"e")

// If there is a problem with aliasing rest it can done here.
val tmp = df.cube(rest.alias("rest") +: cubed: _*).count

快速筛选和选择应该处理其余部分：

tmp.where($"rest".isNotNull).select($"rest.*" +: cubed :+ $"count": _*)

带有以下结果：

+---+---+---+----+----+-----+
|  a|  b|  c|   d|   e|count|
+---+---+---+----+----+-----+
|  1|  2|  3|null|   5|    1|
|  1|  2|  3|null|null|    1|
|  1|  2|  3|   4|   5|    1|
|  1|  2|  3|   4|null|    1|
+---+---+---+----+----+-----+