Spark 2.4.0在左连接空右DF时会出现“检测到隐式笛卡尔积”异常。

10

看起来在 Spark 2.2.1 和 Spark 2.4.0 之间,使用空的右侧 DataFrame 进行左连接的行为从成功变为返回“AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans”。

例如:

val emptyDf = spark.emptyDataFrame
  .withColumn("id", lit(0L))
  .withColumn("brand", lit(""))
val nonemptyDf = ((1L, "a") :: Nil).toDF("id", "size")
val neje = nonemptyDf.join(emptyDf, Seq("id"), "left")
neje.show()

在2.2.1版本中,结果为

+---+----+-----+
| id|size|brand|
+---+----+-----+
|  1|   a| null|
+---+----+-----+

然而,在2.4.0版本中,我遇到了以下异常:

org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans
LocalRelation [id#278L, size#279]
and
Project [ AS brand#55]
+- LogicalRDD false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;

这里是后面的完整计划解释:

> neje.explain(true)

== Parsed Logical Plan ==
'Join UsingJoin(LeftOuter,List(id))
:- Project [_1#275L AS id#278L, _2#276 AS size#279]
:  +- LocalRelation [_1#275L, _2#276]
+- Project [id#53L,  AS brand#55]
   +- Project [0 AS id#53L]
      +- LogicalRDD false

== Analyzed Logical Plan ==
id: bigint, size: string, brand: string
Project [id#278L, size#279, brand#55]
+- Join LeftOuter, (id#278L = id#53L)
   :- Project [_1#275L AS id#278L, _2#276 AS size#279]
   :  +- LocalRelation [_1#275L, _2#276]
   +- Project [id#53L,  AS brand#55]
      +- Project [0 AS id#53L]
         +- LogicalRDD false

== Optimized Logical Plan ==
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans
LocalRelation [id#278L, size#279]
and
Project [ AS brand#55]
+- LogicalRDD false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans
LocalRelation [id#278L, size#279]
and
Project [ AS brand#55]
+- LogicalRDD false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;

其他观察:

  • 如果只有左侧的数据框为空,则连接成功。
  • 类似的行为变化对于右连接和空左侧数据帧也是真实的。
  • 然而,有趣的是,需要注意的是,如果两个数据框都为空,内部连接的两个版本都会失败并引发AnalysisException异常。

这是一次回归或设计如此?我认为以前的行为更正确。我没有在Spark发布说明、Spark JIRA问题或StackOverflow问题中找到任何相关信息。

3个回答

15

我的问题不完全和你一样,但至少出现了相同的错误,我通过明确允许交叉连接来解决了这个问题:

我没有完全遇到你所遇到的问题,但是至少出现了相同的错误,我通过明确允许交叉连接来解决了这个问题:

spark.conf.set( "spark.sql.crossJoin.enabled" , "true" )

1
我曾多次遇到这个问题。最近一次是因为我在多个操作中使用了一个数据帧,所以每次都要重新计算。 一旦我在源头上缓存了它,这个错误就消失了。

0

变更

val neje = nonemptyDf.join(emptyDf, Seq("id"), "left")

val neje = nonemptyDf.join(emptyDf, nonemptyDf("id") === emptyDf("id"), "left")

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接