Java：测试Spark-SQL

Question

Java：测试Spark-SQL

javascalaapache-sparkapache-spark-sqlbigdata

4

我使用spark-sql为应用程序编写了测试，但这些测试无法正常工作。没有使用spark-sql模块 - 所有测试都可以正常工作（RDD）。

库版本：

Junit：4.12
Spark Core：2.2.1
Spark Sql：2.2.1

这个测试是：

List<Claim> claims = FileResource.loadListObjOfFile("cg-32-claims-load.json", Claim[].class);
assertTrue(claims.size() == 1000L);

Dataset<Claim> dataset = getSparkSession().createDataset(claims, Encoders.bean(Claim.class));
assertTrue(dataset.count() == 1000L);

Dataset<ResultBean> resDataSet = dataset
        .groupByKey((MapFunction<Claim, Integer>) Claim::getMbrId, Encoders.INT())
        .mapGroups((MapGroupsFunction<Integer, Claim, ResultBean>) (key, values) -> new ResultBean(), Encoders.bean(ResultBean.class));

assertTrue(resDataSet.count() == 42L);

在最后一行出现了一个异常。应用程序仅在测试中抛出此异常。（简单的主类 - 正常工作）。

看起来像是Spark SQL由于某些原因无法初始化java bean。

堆栈跟踪：

+- AppendColumns <function1>, initializejavabean(newInstance(class test.input.Claim), (setDiag1,diag1#28.toString), .... [input[0, java.lang.Integer, true].intValue AS value#84]
   +- LocalTableScan [birthDt#23, birthDtStr#24, clmFromDt#25, .... pcdCd#45, plcOfSvcCd#46, ... 2 more fields]

    at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
    at org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    ....
    Caused by: java.lang.AssertionError: index (23) should < 23
    at org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:133)
    at org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:352)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply2_7$(generated.java:52)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:600)
    at org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:41)
    at org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:41)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows$lzycompute(LocalTableScanExec.scala:41)
    at org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows(LocalTableScanExec.scala:36)
    at org.apache.spark.sql.execution.LocalTableScanExec.rdd$lzycompute(LocalTableScanExec.scala:48)
    at org.apache.spark.sql.execution.LocalTableScanExec.rdd(LocalTableScanExec.scala:48)
    at org.apache.spark.sql.execution.LocalTableScanExec.doExecute(LocalTableScanExec.scala:52)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
    at org.apache.spark.sql.execution.AppendColumnsExec.doExecute(objects.scala:272)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
    at org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:88)
    at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:124)
    at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:115)
    at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
    ... 86 more

- yazabara

2

你的 DataSet 有多少列？ https://forums.databricks.com/questions/340/how-do-i-create-a-spark-sql-table-with-columns-gre.html你是否遇到了 Scala case class 中列数达到最大值的情况？ - Sudev Ambadi

嗯...这个bean有23列... - yazabara

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Shwetabh Dixit · Answer 1

当 bean 类出现问题时，就会出现此错误。检查您的 bean 类是否具有所有 getter 和 setter 可以帮助解决此问题。

希望这可以帮助任何遇到此问题的人！