我正在处理一个Spark数据框架,它可以从三个不同的模式版本中加载数据。
// Original
{ "A": {"B": 1 } }
// Addition "C"
{ "A": {"B": 1 }, "C": 2 }
// Additional "A.D"
{ "A": {"B": 1, "D": 3 }, "C": 2 }
我可以通过检查模式是否包含字段"C",如果不包含,则为数据框添加新列来处理额外的"C"。然而,我无法想出如何为子对象创建字段。
public void evolvingSchema() {
String versionOne = "{ \"A\": {\"B\": 1 } }";
String versionTwo = "{ \"A\": {\"B\": 1 }, \"C\": 2 }";
String versionThree = "{ \"A\": {\"B\": 1, \"D\": 3 }, \"C\": 2 }";
process(spark.getContext(), "1", versionOne);
process(spark.getContext(), "2", versionTwo);
process(spark.getContext(), "2", versionThree);
}
private static void process(JavaSparkContext sc, String version, String data) {
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read().json(sc.parallelize(Arrays.asList(data)));
if(!Arrays.asList(df.schema().fieldNames()).contains("C")) {
df = df.withColumn("C", org.apache.spark.sql.functions.lit(null));
}
// Not sure what to put here. The fieldNames does not contain the "A.D"
try {
df.select("C").collect();
} catch(Exception e) {
System.out.println("Failed to C for " + version);
}
try {
df.select("A.D").collect();
} catch(Exception e) {
System.out.println("Failed to A.D for " + version);
}
}
StructType
? - Michael Lloyd Lee mlk