背景
假设你处理时间序列数据。你需要使用多个具有不同窗口规范的窗口函数来得到所需结果。结果可能类似于单个spark列表达式,例如区间标识符。
现状
通常情况下,我不使用df.withColumn
存储中间结果,而是链接/堆叠列表达式,并相信Spark会找到最有效的DAG(在处理DataFrame时)。
可重现示例
但是,在以下示例中(PySpark 2.4.4独立版),使用df.withColumn
存储中间结果可以减少DAG的复杂性。让我们考虑以下测试设置:
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession, Window
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
dfp = pd.DataFrame(
{
"col1": np.random.randint(0, 5, size=100),
"col2": np.random.randint(0, 5, size=100),
"col3": np.random.randint(0, 5, size=100),
"col4": np.random.randint(0, 5, size=100),
}
)
df = spark.createDataFrame(dfp)
df.show(5)
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1| 2| 4| 1|
| 0| 2| 3| 0|
| 2| 0| 1| 0|
| 4| 1| 1| 2|
| 1| 3| 0| 4|
+----+----+----+----+
only showing top 5 rows
计算过程是任意的。基本上,我们有两个窗口规范和三个计算步骤。这三个计算步骤相互依赖,并使用交替的窗口规范:
w1 = Window.partitionBy("col1").orderBy("col2")
w2 = Window.partitionBy("col3").orderBy("col4")
# first step, arbitrary window func over 1st window
step1 = F.lag("col3").over(w1)
# second step, arbitrary window func over 2nd window with step 1
step2 = F.lag(step1).over(w2)
# third step, arbitrary window func over 1st window with step 2
step3 = F.when(step2 > 1, F.max(step2).over(w1))
df_result = df.withColumn("result", step3)
通过df_result.explain()
检查物理计划,可以发现有4个交换和排序操作!但是,实际上只需要3次,因为我们只更改了窗口规范两次。
df_result.explain()
== Physical Plan ==
*(7) Project [col1#0L, col2#1L, col3#2L, col4#3L, CASE WHEN (_we0#25L > 1) THEN _we1#26L END AS result#22L]
+- Window [lag(_w0#23L, 1, null) windowspecdefinition(col3#2L, col4#3L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _we0#25L], [col3#2L], [col4#3L ASC NULLS FIRST]
+- *(6) Sort [col3#2L ASC NULLS FIRST, col4#3L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(col3#2L, 200)
+- *(5) Project [col1#0L, col2#1L, col3#2L, col4#3L, _w0#23L, _we1#26L]
+- Window [max(_w1#24L) windowspecdefinition(col1#0L, col2#1L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS _we1#26L], [col1#0L], [col2#1L ASC NULLS FIRST]
+- *(4) Sort [col1#0L ASC NULLS FIRST, col2#1L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(col1#0L, 200)
+- *(3) Project [col1#0L, col2#1L, col3#2L, col4#3L, _w0#23L, _w1#24L]
+- Window [lag(_w0#27L, 1, null) windowspecdefinition(col3#2L, col4#3L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _w1#24L], [col3#2L], [col4#3L ASC NULLS FIRST]
+- *(2) Sort [col3#2L ASC NULLS FIRST, col4#3L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(col3#2L, 200)
+- Window [lag(col3#2L, 1, null) windowspecdefinition(col1#0L, col2#1L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _w0#27L, lag(col3#2L, 1, null) windowspecdefinition(col1#0L, col2#1L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _w0#23L], [col1#0L], [col2#1L ASC NULLS FIRST]
+- *(1) Sort [col1#0L ASC NULLS FIRST, col2#1L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(col1#0L, 200)
+- Scan ExistingRDD[col1#0L,col2#1L,col3#2L,col4#3L]
改进
为了获得更好的DAG,我们略微修改了代码,使用withColumn
存储step2
的列表达式,并只传递该列的引用。新的逻辑计划确实只需要3个shuffle!
w1 = Window.partitionBy("col1").orderBy("col2")
w2 = Window.partitionBy("col3").orderBy("col4")
# first step, arbitrary window func
step1 = F.lag("col3").over(w1)
# second step, arbitrary window func over 2nd window with step 1
step2 = F.lag(step1).over(w2)
# save temporary
df = df.withColumn("tmp_variable", step2)
step2 = F.col("tmp_variable")
# third step, arbitrary window func over 1st window with step 2
step3 = F.when(step2 > 1, F.max(step2).over(w1))
df_result = df.withColumn("result", step3).drop("tmp_variable")
df_result.explain()
== Physical Plan ==
*(5) Project [col1#0L, col2#1L, col3#2L, col4#3L, CASE WHEN (tmp_variable#33L > 1) THEN _we0#42L END AS result#41L]
+- Window [max(tmp_variable#33L) windowspecdefinition(col1#0L, col2#1L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS _we0#42L], [col1#0L], [col2#1L ASC NULLS FIRST]
+- *(4) Sort [col1#0L ASC NULLS FIRST, col2#1L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(col1#0L, 200)
+- *(3) Project [col1#0L, col2#1L, col3#2L, col4#3L, tmp_variable#33L]
+- Window [lag(_w0#34L, 1, null) windowspecdefinition(col3#2L, col4#3L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS tmp_variable#33L], [col3#2L], [col4#3L ASC NULLS FIRST]
+- *(2) Sort [col3#2L ASC NULLS FIRST, col4#3L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(col3#2L, 200)
+- Window [lag(col3#2L, 1, null) windowspecdefinition(col1#0L, col2#1L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _w0#34L], [col1#0L], [col2#1L ASC NULLS FIRST]
+- *(1) Sort [col1#0L ASC NULLS FIRST, col2#1L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(col1#0L, 200)
+- Scan ExistingRDD[col1#0L,col2#1L,col3#2L,col4#3L]
相关性
我的原始示例更加复杂,导致DAG的差异更大(在实际数据上慢了10倍)。
问题
有人对这种奇怪的行为有答案吗?我认为堆叠/链接列表达式是最佳实践,因为它允许Spark最有效地优化中间步骤(与为中间结果创建引用相反)。
withColumn
等同于子查询 - 在某些情况下(非分析查询),没有区别。但是,分析查询是另一回事。在许多情况下,子查询是必需的。Spark能否优化这个?可能会...为什么结果不同?因为本地表达式和跨逻辑节点(这里是子查询)的优化不同。 - 10465355