我想要将一些嵌套的JSON数据创建为Hive表,并对其运行查询,这是否可能?
我已经上传了JSON文件到S3并启动了EMR实例,但我不知道在Hive控制台中键入什么来使JSON文件成为Hive表?
有没有人有一些示例命令可以帮我入手,我在Google上找不到任何有用的信息...
我想要将一些嵌套的JSON数据创建为Hive表,并对其运行查询,这是否可能?
我已经上传了JSON文件到S3并启动了EMR实例,但我不知道在Hive控制台中键入什么来使JSON文件成为Hive表?
有没有人有一些示例命令可以帮我入手,我在Google上找不到任何有用的信息...
实际上,使用JSON SerDe并非必需。这里有一篇很棒的博客文章(我与作者没有任何关联):
http://pkghosh.wordpress.com/2012/05/06/hive-plays-well-with-json/
该文章介绍了一种使用内置函数json_tuple在查询时解析JSON(而不是在表定义时)的策略:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-json_tuple
因此,基本上您的表模式只需将每行加载为单个“字符串”列,然后根据需要在每个查询中提取相关的JSON字段。例如,来自该博客文章的此查询:
SELECT b.blogID, c.email FROM comments a LATERAL VIEW json_tuple(a.value, 'blogID', 'contact') b
AS blogID, contact LATERAL VIEW json_tuple(b.contact, 'email', 'website') c
AS email, website WHERE b.blogID='64FY4D0B28';
在我的经验中,这种做法被证明更加可靠(我在处理JSON序列化和反序列化时遇到了各种难以理解的问题,特别是涉及嵌套对象时)。create external table impressions (
requestBeginTime string, requestEndTime string, hostname string
)
partitioned by (
dt string
)
row format
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties (
'paths'='requestBeginTime, requestEndTime, hostname'
)
location 's3://my.bucket/' ;
要使用HCatalog的JsonSerDe,请将hcatalog-core .jar添加到Hive的auxpath中,并创建您的Hive表:
$ hive --auxpath /path/to/hcatalog-core.jar
hive (default)>
create table my_table(...)
ROW FORMAT SERDE
'org.apache.hcatalog.data.JsonSerDe'
...
;
--auxpath
选项,但是ADD JAR
命令可以达到同样的效果。 - wingedsubmarinerHive 0.12及以上版本的hcatalog-core具有JsonSerDe,可以序列化和反序列化您的JSON数据。因此,您只需要创建一个外部表,例如下面的示例:
CREATE EXTERNAL TABLE json_table (
username string,
tweet string,
timestamp long)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION
'hdfs://data/some-folder-in-hdfs'
相应的JSON数据文件应该像以下示例一样:
{"username":"miguno","tweet":"Rock: Nerf paper, scissors is fine.","timestamp": 1366150681 }
{"username":"BlizzardCS","tweet":"Works as intended. Terran is IMBA.","timestamp": 1366154481 }
如果您的 .json 文件很大,手动编写模式可能会很繁琐。如果是这样,您可以使用这个方便的工具自动生成模式。
现在,Hive内置了JSON处理能力。
Hive 4.0.0及更高版本
CREATE TABLE ... STORED AS JSONFILE
每个JSON对象必须被压缩成一行以适应(不支持换行符)。这些对象不是正式的JSON数组的一部分。{"firstName":"John","lastName":"Smith","Age":21}
{"firstName":"Jane","lastName":"Harding","Age":18}
org.openx.data.jsonserde.JsonSerDe
或org.apache.hive.hcatalog.data.JsonSerDe
。
org.apache.hive.hcatalog.data.JsonSerDeorg.openx.data.jsonserde.JsonSerDe
OpenX JSON SerDe类似于Apache原生的JSON序列化程序,但它提供了多种可选属性,例如“ignore.malformed.json”、“case.insensitive”等。在我看来,当处理嵌套的JSON文件时,它通常比原生更有效。
以这个复杂的JSON文件示例为例:
{
"schemaVersion": "1.0",
"id": "07c1687a0fd34ebf8a42e8a8627321dc",
"accountId": "123456677",
"partition": "aws",
"region": "us-west-2",
"severity": {
"score": "0",
"description": "Informational"
},
"createdAt": "2021-02-27T18:57:07Z",
"resourcesAffected": {
"s3Bucket": {
"arn": "arn:aws:s3:::bucket-sample",
"name": "bucket-sample",
"createdAt": "2020-08-09T07:24:55Z",
"owner": {
"displayName": "account-name",
"id": "919a30c2f56c0b220c32e9234jnkj435n6jk4nk"
},
"tags": [],
"defaultServerSideEncryption": {
"encryptionType": "AES256"
},
"publicAccess": {
"permissionConfiguration": {
"bucketLevelPermissions": {
"accessControlList": {
"allowsPublicReadAccess": false,
"allowsPublicWriteAccess": false
},
"bucketPolicy": {
"allowsPublicReadAccess": true,
"allowsPublicWriteAccess": false
},
"blockPublicAccess": {
"ignorePublicAcls": false,
"restrictPublicBuckets": false,
"blockPublicAcls": false,
"blockPublicPolicy": false
}
},
"accountLevelPermissions": {
"blockPublicAccess": {
"ignorePublicAcls": false,
"restrictPublicBuckets": false,
"blockPublicAcls": false,
"blockPublicPolicy": false
}
}
},
"effectivePermission": "PUBLIC"
}
},
"s3Object": {
"bucketArn": "arn:aws:s3:::bucket-sample",
"key": "2021/01/17191133/Camping-Checklist-Google-Docs.pdf",
"path": "bucket-sample/2021/01/17191133/Camping-Checklist-Google-Docs.pdf",
"extension": "pdf",
"lastModified": "2021-01-17T22:11:34Z",
"eTag": "e8d990704042d2e1b7bb504fb5868095",
"versionId": "isqHLkSsQUMbbULNT2nMDneMG0zqitbD",
"serverSideEncryption": {
"encryptionType": "AES256"
},
"size": "150532",
"storageClass": "STANDARD",
"tags": [],
"publicAccess": true
}
},
"category": "CLASSIFICATION",
"classificationDetails": {
"jobArn": "arn:aws:macie2:us-west-2:123412341341:classification-job/d6cf41ccc7ea8daf3bd53ddcb86a2da5",
"result": {
"status": {
"code": "COMPLETE"
},
"sizeClassified": "150532",
"mimeType": "application/pdf",
"sensitiveData": []
},
"detailedResultsLocation": "s3://bucket-macie/AWSLogs/123412341341/Macie/us-west-2/d6cf41ccc7ea8daf3bd53ddcb86a2da5/123412341341/50de3137-9806-3e43-9b6e-a6158fdb0e3b.jsonl.gz",
"jobId": "d6cf41ccc7ea8daf3bd53ddcb86a2da5"
}
}
CREATE EXTERNAL TABLE IF NOT EXISTS `macie`.`macie_bucket` (
`schemaVersion` STRING,
`id` STRING,
`accountId` STRING,
`partition` STRING,
`region` STRING,
`severity` STRUCT<
`score`:STRING,
`description`:STRING>,
`createdAt` STRING,
`resourcesAffected` STRUCT<
`s3Bucket`:STRUCT<
`arn`:STRING,
`name`:STRING,
`createdAt`:STRING,
`owner`:STRUCT<
`displayName`:STRING,
`id`:STRING>,
`defaultServerSideEncryption`:STRUCT<
`encryptionType`:STRING>,
`publicAccess`:STRUCT<
`permissionConfiguration`:STRUCT<
`bucketLevelPermissions`:STRUCT<
`accessControlList`:STRUCT<
`allowsPublicReadAccess`:BOOLEAN,
`allowsPublicWriteAccess`:BOOLEAN>,
`bucketPolicy`:STRUCT<
`allowsPublicReadAccess`:BOOLEAN,
`allowsPublicWriteAccess`:BOOLEAN>,
`blockPublicAccess`:STRUCT<
`ignorePublicAcls`:BOOLEAN,
`restrictPublicBuckets`:BOOLEAN,
`blockPublicAcls`:BOOLEAN,
`blockPublicPolicy`:BOOLEAN>>,
`accountLevelPermissions`:STRUCT<
`blockPublicAccess`:STRUCT<
`ignorePublicAcls`:BOOLEAN,
`restrictPublicBuckets`:BOOLEAN,
`blockPublicAcls`:BOOLEAN,
`blockPublicPolicy`:BOOLEAN>>>,
`effectivePermission`:STRING>>,
`s3Object`:STRUCT<
`bucketArn`:STRING,
`key`:STRING,
`path`:STRING,
`extension`:STRING,
`lastModified`:STRING,
`eTag`:STRING,
`versionId`:STRING,
`serverSideEncryption`:STRUCT<
`encryptionType`:STRING>,
`size`:STRING,
`storageClass`:STRING,
`publicAccess`:BOOLEAN>>,
`category` STRING,
`classificationDetails` STRUCT<
`jobArn`:STRING,
`result`:STRUCT<
`status`:STRUCT<
`code`:STRING>,
`sizeClassified`:STRING,
`mimeType`:STRING>,
`detailedResultsLocation`:STRING,
`jobId`:STRING>)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
LOCATION
's3://awsexamplebucket1-logs/AWSLogs/'
LOAD DATA LOCAL INPATH 's3://my.bucket/data.json' OVERWRITE INTO TABLE Awards;
但是它也不起作用。 - nickponline