我已经在一个json文件中定义了df的架构,如下所示:
{
"table1":{
"fields":[
{"metadata":{}, "name":"first_name", "type":"string", "nullable":false},
{"metadata":{}, "name":"last_name", "type":"string", "nullable":false},
{"metadata":{}, "name":"subjects", "type":"array","items":{"type":["string", "string"]}, "nullable":false},
{"metadata":{}, "name":"marks", "type":"array","items":{"type":["integer", "integer"]}, "nullable":false},
{"metadata":{}, "name":"dept", "type":"string", "nullable":false}
]
}
}
例子 JSON 数据:
{
"table1": [
{
"first_name":"john",
"last_name":"doe",
"subjects":["maths","science"],
"marks":[90,67],
"dept":"abc"
},
{
"first_name":"dan",
"last_name":"steyn",
"subjects":["maths","science"],
"marks":[90,67],
"dept":"abc"
},
{
"first_name":"rose",
"last_name":"wayne",
"subjects":["maths","science"],
"marks":[90,67],
"dept":"abc"
},
{
"first_name":"nat",
"last_name":"lee",
"subjects":["maths","science"],
"marks":[90,67],
"dept":"abc"
},
{
"first_name":"jim",
"last_name":"lim",
"subjects":["maths","science"],
"marks":[90,67],
"dept":"abc"
}
]
}
我想根据这个json文件创建相应的Spark模式。以下是我的代码:(参考:Create spark dataframe schema from json schema representation))
with open(schemaFile) as s:
schema = json.load(s)["table1"]
source_schema = StructType.fromJson(schema)
如果我的模式中有数组列,则上述代码在没有任何数组列的情况下可以正常工作。但是,如果我在模式中有数组列,则会抛出以下错误。
"无法解析数据类型:array" (“无法解析数据类型:%s” json_value)
"items":{"type":["string", "string"]}
后面缺少一个逗号。我认为最好发布您的实际数据或尝试在Spark中加载json,然后导出由Spark创建的模式。 - abiratsis"items":{"type":["string", "string"]}
不是有效的定义,您到底想说什么?你能发布一些实际的JSON数据吗? - abiratsis