根据特定列数据将Pyspark数据框拆分为多个JSON文件?

3
我有以下格式的JSON:
{"year":"2020", "id":"1", "fruit":"Apple","cost": "100" }
{"year":"2020", "id":"2", "fruit":"Kiwi", "cost": "200"}
{"year":"2020", "id":"3", "fruit":"Cherry", "cost": "300"}
{"year":"2020", "id":"4", "fruit": "Apple","cost": "400" }
{"year":"2020", "id":"5", "fruit": "Mango", "cost": "500"}
{"year":"2020", "id":"6", "fruit": "Kiwi", "cost": "600"}

它的类型是:pyspark.sql.dataframe.DataFrame 我该如何使用Pyspark将此JSON文件拆分为多个JSON文件,并将其保存在一个名为year的目录中?例如:
目录:path.../2020/<所有拆分后的JSON文件>

Apple.json

{"year":"2020", "id":"1", "fruit":"Apple","cost": "100" }
{"year":"2020", "id":"4", "fruit": "Apple","cost": "400" }

Kiwi.json

{"year":"2020", "id":"2", "fruit":"Kiwi", "cost": "200"}
{"year":"2020", "id":"6", "fruit": "Kiwi", "cost": "600"}

Mango.json

{"year":"2020", "id":"5", "fruit": "Mango", "cost": "500"}

Cherry.json

{"year":"2020", "id":"3", "fruit":"Cherry", "cost": "300"}

如果我遇到不同的年份,如何以类似的方式推送文件,例如:path.../2021/<all split json files>
最初,我尝试找到所有独特的水果并创建一个列表。然后尝试创建多个数据帧,并将json值推入其中。然后将每个数据帧转换为json格式。但我觉得这样效率低下。 然后我也查看了link。但问题在于它创建了一个键值对形式的字典,略有不同。
然后我还了解了Pyspark groupBy方法。它似乎很有意义,因为我可以按水果值进行groupBy(),然后拆分json文件,但我感觉我错过了什么。
1个回答

2

以以下JSON为例

{"year":"2020", "id":"1", "fruit":"Apple","cost": "100" }
{"year":"2020", "id":"2", "fruit":"Kiwi", "cost": "200"}
{"year":"2020", "id":"3", "fruit":"Cherry", "cost": "300"}
{"year":"2021", "id":"10", "fruit": "Pear","cost": "1000" }
{"year":"2021", "id":"11", "fruit": "Mango", "cost": "1100"}
{"year":"2021", "id":"12", "fruit": "Banana", "cost": "1200"}

您可以使用partitionByyearfruit对数据进行分区。请注意,我创建了年份列的副本,因为在将数据写入磁盘时,您分区的列会被删除。

df = spark.read.json("./ex.json")
df = df.withColumn("Year", df["year"])
df = df.withColumn("Fruit", df["fruit"])
df.write.partitionBy("Year", "Fruit").json("result")

这将生成一个名为RESULT的文件夹,其结构如下。

|-- RESULT
|   |-- Year=2020
|   |   |-- Fruit=Apple
|   |   |   |-- part0000-dcea0683...json
|   |   |-- Fruit=Cherry
|   |   |   |-- part0000-dcea0683...json
|   |   |-- Fruit=Kiwi
|   |   |   |-- part0000-dcea0683...json
|   |-- Year=2021
|   |   |-- Fruit=Banana
|   |   |   |-- part0000-dcea0683...json
|   |   |-- Fruit=Mango
|   |   |   |-- part0000-dcea0683...json
|   |   |-- Fruit=Pear
|   |   |   |-- part0000-dcea0683...json

1
非常感谢。 - Developer_101

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接