Spark Java:使用给定的模式创建新数据集

12

我有这段Scala代码,它已经可以很好地运行:

val schema = StructType(Array(
        StructField("field1", StringType, true),
        StructField("field2", TimestampType, true),
        StructField("field3", DoubleType, true),
        StructField("field4", StringType, true),
        StructField("field5", StringType, true)
    ))

val df = spark.read
    // some options
    .schema(schema)
    .load(myEndpoint)

我希望在Java中做类似的事情。因此我的代码如下:

final StructType schema = new StructType(new StructField[] {
     new StructField("field1",  new StringType(), true,new Metadata()),
     new StructField("field2", new TimestampType(), true,new Metadata()),
     new StructField("field3", new StringType(), true,new Metadata()),
     new StructField("field4", new StringType(), true,new Metadata()),
     new StructField("field5", new StringType(), true,new Metadata())
});

Dataset<Row> df = spark.read()
    // some options
    .schema(schema)
    .load(myEndpoint);

但是这会给我带来以下错误:

Exception in thread "main" scala.MatchError: org.apache.spark.sql.types.StringType@37c5b8e8 (of class org.apache.spark.sql.types.StringType)

我的模式似乎没有问题,所以我不太清楚这里的问题出在哪里。

spark.read().load(myEndpoint).printSchema();
root
 |-- field5: string (nullable = true)
 |-- field2: timestamp (nullable = true)
 |-- field1: string (nullable = true)
 |-- field4: string (nullable = true)
 |-- field3: string (nullable = true)

schema.printTreeString();
root
 |-- field1: string (nullable = true)
 |-- field2: timestamp (nullable = true)
 |-- field3: string (nullable = true)
 |-- field4: string (nullable = true)
 |-- field5: string (nullable = true)

编辑:

这里是数据示例:

spark.read().load(myEndpoint).show(false);
+---------------------------------------------------------------+-------------------+-------------+--------------+---------+
|field5                                                         |field2             |field1       |field4        |field3   |
+---------------------------------------------------------------+-------------------+-------------+--------------+---------+
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"}  |2018-01-20 16:54:50|SOME_VALUE   |SOME_VALUE    |0.0      |
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"}  |2018-01-20 16:58:50|SOME_VALUE   |SOME_VALUE    |50.0     |
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"}  |2018-01-20 17:00:50|SOME_VALUE   |SOME_VALUE    |20.0     |
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"}  |2018-01-20 18:04:50|SOME_VALUE   |SOME_VALUE    |10.0     |
 ...
+---------------------------------------------------------------+-------------------+-------------+--------------+---------+
1个回答

20

在Spark 2.3.1中,使用Datatypes类的静态方法和字段而非构造函数对我有效:

    StructType schema = DataTypes.createStructType(new StructField[] {
            DataTypes.createStructField("field1",  DataTypes.StringType, true),
            DataTypes.createStructField("field2", DataTypes.TimestampType, true),
            DataTypes.createStructField("field3", DataTypes.StringType, true),
            DataTypes.createStructField("field4", DataTypes.StringType, true),
            DataTypes.createStructField("field5", DataTypes.StringType, true)
    });

嗨,感谢您的回答。 关于DoubleType,您是正确的,但这不是问题所在。 使用构造函数而不是静态DataTypes函数才是问题所在。非常感谢! - Nakeuh
2
如果有人想知道,你可以通过将“.schema(schema)”简单地链接到读取语句中来将上述定义的模式添加到数据框中。 - Reza Keshavarz

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接