我有一个像这样简单的数据框:
rdd = sc.parallelize(
[
(0, "A", 223,"201603", "PORT"),
(0, "A", 22,"201602", "PORT"),
(0, "A", 422,"201601", "DOCK"),
(1,"B", 3213,"201602", "DOCK"),
(1,"B", 3213,"201601", "PORT"),
(2,"C", 2321,"201601", "DOCK")
]
)
df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])
df_data.show()
+---+----+----+------+----+
| id|type|cost| date|ship|
+---+----+----+------+----+
| 0| A| 223|201603|PORT|
| 0| A| 22|201602|PORT|
| 0| A| 422|201601|DOCK|
| 1| B|3213|201602|DOCK|
| 1| B|3213|201601|PORT|
| 2| C|2321|201601|DOCK|
+---+----+----+------+----+
我需要按日期进行数据透视:
df_data.groupby(df_data.id, df_data.type).pivot("date").avg("cost").show()
+---+----+------+------+------+
| id|type|201601|201602|201603|
+---+----+------+------+------+
| 2| C|2321.0| null| null|
| 0| A| 422.0| 22.0| 223.0|
| 1| B|3213.0|3213.0| null|
+---+----+------+------+------+
一切都按预期进行。但现在我需要对其进行透视并获取一个非数值列:
df_data.groupby(df_data.id, df_data.type).pivot("date").avg("ship").show()
当然,我会得到一个异常:
AnalysisException: u'"ship" is not a numeric column. Aggregation function can only be applied on a numeric column.;'
我想生成类似以下的内容:
+---+----+------+------+------+
| id|type|201601|201602|201603|
+---+----+------+------+------+
| 2| C|DOCK | null| null|
| 0| A| DOCK | PORT| DOCK|
| 1| B|DOCK |PORT | null|
+---+----+------+------+------+
使用 pivot
可以实现吗?
collect_set
来保留所有的ship
值。 - Jacek Laskowskimax(struct
替换为collect_set
,然后你就完成了。不过我正在寻找一个机会将其作为一个完整的答案使用。你知道有哪些问题需要这样的回答吗?;-) - Jacek Laskowski