考虑以下数据:
Name | Flag
A | 0
A | 1
A | 0
B | 0
B | 1
B | 1
我想把它转换为:
Name | Total | With Flag | Percentage
A | 3 | 1 | 33%
B | 3 | 2 | 66%
最好使用Spark SQL。
考虑以下数据:
Name | Flag
A | 0
A | 1
A | 0
B | 0
B | 1
B | 1
Name | Total | With Flag | Percentage
A | 3 | 1 | 33%
B | 3 | 2 | 66%
最好使用Spark SQL。
比如说:
val df = sc.parallelize(Seq(
("A", 0), ("A", 1), ("A", 0),
("B", 0), ("B", 1), ("B", 1)
)).toDF("Name", "Flag")
df.groupBy($"Name").agg(
count("*").alias("total"),
sum($"flag").alias("with_flag"),
// Do you really want to truncate not for example round?
mean($"flag").multiply(100).cast("integer").alias("percentage"))
// +----+-----+---------+----------+
// |name|total|with_flag|percentage|
// +----+-----+---------+----------+
// | A| 3| 1| 33|
// | B| 3| 2| 66|
// +----+-----+---------+----------+
或者:
df.registerTempTable("df")
sqlContext.sql("""
SELECT name, COUNT(*) total, SUM(flag) with_flag,
CAST(AVG(flag) * 100 AS INT) percentage
FROM df
GROUP BY name""")
// +----+-----+---------+----------+
// |name|total|with_flag|percentage|
// +----+-----+---------+----------+
// | A| 3| 1| 33|
// | B| 3| 2| 66|
// +----+-----+---------+----------+
df.groupBy($"Name").agg(avg(($"Flag" > 50).cast("int")))
- zero323$ ("Flag" > 50)。cast("int")
创建一个{0,1}
指标变量。其余部分就是平均水平。 - zero323