我有一个 PySpark 数据框,长这样:
Id timestamp col1 col2
abc 789 0 1
def 456 1 0
abc 123 1 0
def 321 0 1
我想按照ID列分组或分区,然后根据时间戳的顺序创建col1和col2的列表。
Id timestamp col1 col2
abc [123,789] [1,0] [0,1]
def [321,456] [0,1] [1,0]
我的方法:
from pyspark.sql import functions as F
from pyspark.sql import Window as W
window_spec = W.partitionBy("id").orderBy('timestamp')
ranged_spec = window_spec.rowsBetween(W.unboundedPreceding, W.unboundedFollowing)
df1 = df.withColumn("col1", F.collect_list("reco").over(window_spec))\
.withColumn("col2", F.collect_list("score").over(window_spec))\
df1.show()
但这并没有返回col1和col2的列表。