链式的Spark列表达式与不同的窗口规范生成了低效的DAG。

Question

链式的Spark列表达式与不同的窗口规范生成了低效的DAG。

pythondataframeapache-sparkpysparkdirected-acyclic-graphs

12

背景

假设你处理时间序列数据。你需要使用多个具有不同窗口规范的窗口函数来得到所需结果。结果可能类似于单个spark列表达式，例如区间标识符。

现状

通常情况下，我不使用df.withColumn存储中间结果，而是链接/堆叠列表达式，并相信Spark会找到最有效的DAG（在处理DataFrame时）。

可重现示例

但是，在以下示例中（PySpark 2.4.4独立版），使用df.withColumn存储中间结果可以减少DAG的复杂性。让我们考虑以下测试设置：

import pandas as pd
import numpy as np

from pyspark.sql import SparkSession, Window
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()

dfp = pd.DataFrame(
    {
        "col1": np.random.randint(0, 5, size=100),
        "col2": np.random.randint(0, 5, size=100),
        "col3": np.random.randint(0, 5, size=100),
        "col4": np.random.randint(0, 5, size=100),        
    }
)

df = spark.createDataFrame(dfp)
df.show(5)

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   1|   2|   4|   1|
|   0|   2|   3|   0|
|   2|   0|   1|   0|
|   4|   1|   1|   2|
|   1|   3|   0|   4|
+----+----+----+----+
only showing top 5 rows

计算过程是任意的。基本上，我们有两个窗口规范和三个计算步骤。这三个计算步骤相互依赖，并使用交替的窗口规范：

w1 = Window.partitionBy("col1").orderBy("col2")
w2 = Window.partitionBy("col3").orderBy("col4")

# first step, arbitrary window func over 1st window
step1 = F.lag("col3").over(w1)

# second step, arbitrary window func over 2nd window with step 1
step2 = F.lag(step1).over(w2)

# third step, arbitrary window func over 1st window with step 2
step3 = F.when(step2 > 1, F.max(step2).over(w1))

df_result = df.withColumn("result", step3)

通过df_result.explain()检查物理计划，可以发现有4个交换和排序操作！但是，实际上只需要3次，因为我们只更改了窗口规范两次。

df_result.explain()

== Physical Plan ==
*(7) Project [col1#0L, col2#1L, col3#2L, col4#3L, CASE WHEN (_we0#25L > 1) THEN _we1#26L END AS result#22L]
+- Window [lag(_w0#23L, 1, null) windowspecdefinition(col3#2L, col4#3L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _we0#25L], [col3#2L], [col4#3L ASC NULLS FIRST]
   +- *(6) Sort [col3#2L ASC NULLS FIRST, col4#3L ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(col3#2L, 200)
         +- *(5) Project [col1#0L, col2#1L, col3#2L, col4#3L, _w0#23L, _we1#26L]
            +- Window [max(_w1#24L) windowspecdefinition(col1#0L, col2#1L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS _we1#26L], [col1#0L], [col2#1L ASC NULLS FIRST]
               +- *(4) Sort [col1#0L ASC NULLS FIRST, col2#1L ASC NULLS FIRST], false, 0
                  +- Exchange hashpartitioning(col1#0L, 200)
                     +- *(3) Project [col1#0L, col2#1L, col3#2L, col4#3L, _w0#23L, _w1#24L]
                        +- Window [lag(_w0#27L, 1, null) windowspecdefinition(col3#2L, col4#3L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _w1#24L], [col3#2L], [col4#3L ASC NULLS FIRST]
                           +- *(2) Sort [col3#2L ASC NULLS FIRST, col4#3L ASC NULLS FIRST], false, 0
                              +- Exchange hashpartitioning(col3#2L, 200)
                                 +- Window [lag(col3#2L, 1, null) windowspecdefinition(col1#0L, col2#1L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _w0#27L, lag(col3#2L, 1, null) windowspecdefinition(col1#0L, col2#1L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _w0#23L], [col1#0L], [col2#1L ASC NULLS FIRST]
                                    +- *(1) Sort [col1#0L ASC NULLS FIRST, col2#1L ASC NULLS FIRST], false, 0
                                       +- Exchange hashpartitioning(col1#0L, 200)
                                          +- Scan ExistingRDD[col1#0L,col2#1L,col3#2L,col4#3L]

改进

为了获得更好的DAG，我们略微修改了代码，使用withColumn存储step2的列表达式，并只传递该列的引用。新的逻辑计划确实只需要3个shuffle！

w1 = Window.partitionBy("col1").orderBy("col2")
w2 = Window.partitionBy("col3").orderBy("col4")

# first step, arbitrary window func
step1 = F.lag("col3").over(w1)

# second step, arbitrary window func over 2nd window with step 1
step2 = F.lag(step1).over(w2)

# save temporary
df = df.withColumn("tmp_variable", step2)
step2 = F.col("tmp_variable")

# third step, arbitrary window func over 1st window with step 2
step3 = F.when(step2 > 1, F.max(step2).over(w1))

df_result = df.withColumn("result", step3).drop("tmp_variable")
df_result.explain()

== Physical Plan ==
*(5) Project [col1#0L, col2#1L, col3#2L, col4#3L, CASE WHEN (tmp_variable#33L > 1) THEN _we0#42L END AS result#41L]
+- Window [max(tmp_variable#33L) windowspecdefinition(col1#0L, col2#1L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS _we0#42L], [col1#0L], [col2#1L ASC NULLS FIRST]
   +- *(4) Sort [col1#0L ASC NULLS FIRST, col2#1L ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(col1#0L, 200)
         +- *(3) Project [col1#0L, col2#1L, col3#2L, col4#3L, tmp_variable#33L]
            +- Window [lag(_w0#34L, 1, null) windowspecdefinition(col3#2L, col4#3L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS tmp_variable#33L], [col3#2L], [col4#3L ASC NULLS FIRST]
               +- *(2) Sort [col3#2L ASC NULLS FIRST, col4#3L ASC NULLS FIRST], false, 0
                  +- Exchange hashpartitioning(col3#2L, 200)
                     +- Window [lag(col3#2L, 1, null) windowspecdefinition(col1#0L, col2#1L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _w0#34L], [col1#0L], [col2#1L ASC NULLS FIRST]
                        +- *(1) Sort [col1#0L ASC NULLS FIRST, col2#1L ASC NULLS FIRST], false, 0
                           +- Exchange hashpartitioning(col1#0L, 200)
                              +- Scan ExistingRDD[col1#0L,col2#1L,col3#2L,col4#3L]

问题

有人对这种奇怪的行为有答案吗？我认为堆叠/链接列表达式是最佳实践，因为它允许Spark最有效地优化中间步骤（与为中间结果创建引用相反）。

- pansen

1

“我曾经认为堆叠/链接列表达式是最佳实践，因为它可以让Spark最有效地优化中间步骤” - 这不是真的。withColumn等同于子查询 - 在某些情况下（非分析查询），没有区别。但是，分析查询是另一回事。在许多情况下，子查询是必需的。Spark能否优化这个？可能会...为什么结果不同？因为本地表达式和跨逻辑节点（这里是子查询）的优化不同。 - 10465355

@10465355说感谢您的回复。您有关于这个主题的进一步文档/解释的提示吗？我发现在网上很难找到适当和最新的资源。 - pansen

我认为实际上并没有记录，除了像SQL标准这样的东西（Spark试图遵循）。你最好检查源代码，因为它是内部组件，可能会随版本而改变。 - 10465355

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ehsan · Answer 1

如果我们查看分析逻辑计划(by=df_result.explain(True))，我们可以看到，虽然我们没有tmp_variable，但由于在创建逻辑计划的过程中数据集/数据帧/表的**惰性评估**，分析器会假定该列存在（惰性）。由于这种假设，现在需要构建比前一种情况少2个辅助窗口才能实现相同的结果。实际上，通过遵循解析逻辑计划，我们可以看到，在创建tmp_variable时，分析器需要构建更少的未评估窗口(windowspecdefinition)，其中它大多数情况下执行简单的项目选择而不是在其推送方式上构建窗口。

链式的Spark列表达式与不同的窗口规范生成了低效的DAG。

背景

现状

可重现示例

改进

相关性

问题