Spark 行编码器：空元数据

Question

Spark 行编码器：空元数据

4

我正在使用Java中的Spark，并从一个Row类型的RDD创建一个行数据集（Dataset of Row）。

我使用以下方法创建模式：

Metadata meta = new MetadataBuilder().putString("type", "categorical").build();
StructField s = new StructField(name, IntegerType, true, meta);
StructType t = new StructType(new StructField[]{s});  
Encoder<Row> encoder = RowEncoder.apply(t);

我在数据集中使用它，就像这样

ds.flatMap((FlatMapFunction<Row, Row>) this::customFlatMapRow, encoder);

由于某些原因，我在编写表格后检查模式的字段及其元数据为空(尽管我像上面那样创建并设置了它们)。不知何故，我失去了它们。

- alexlipa

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- werner · Answer 1

如果您检查数据集的ExpressionEncoder，则可以获得元数据。

代码

Metadata meta = new MetadataBuilder().putString("type", "categorical").build();
StructField s = new StructField("col", IntegerType, true, meta);
StructType t = new StructType(new StructField[]{s});
Encoder<Row> encoder = RowEncoder.apply(t);

Dataset<Row> df = spark.createDataset(Arrays.asList(1, 2, 3), Encoders.INT()).toDF("col");
Dataset<Row> df2 = df.flatMap((FlatMapFunction<Row, Row>) r -> Collections.singleton(r).iterator(), encoder);
System.out.println(df2.exprEnc().schema().fields()[0].metadata());

打印

{"type":"categorical"}