如何在pandas中读取大型json文件？

Question

如何在pandas中读取大型json文件？

14

我的代码是：data_review=pd.read_json('review.json') 我有如下数据review

{
    // string, 22 character unique review id
    "review_id": "zdSx_SD6obEhz9VrW9uAWA",

    // string, 22 character unique user id, maps to the user in user.json
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g",

    // string, 22 character business id, maps to business in business.json
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",

    // integer, star rating
    "stars": 4,

    // string, date formatted YYYY-MM-DD
    "date": "2016-03-09",

    // string, the review itself
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",

    // integer, number of useful votes received
    "useful": 0,

    // integer, number of funny votes received
    "funny": 0,

    // integer, number of cool votes received
    "cool": 0
}

但我收到了以下错误：

    333             fh, handles = _get_handle(filepath_or_buffer, 'r',
    334                                       encoding=encoding)
--> 335             json = fh.read()
    336             fh.close()
    337         else:

OSError: [Errno 22] Invalid argument

我的json文件没有任何注释，且大小为3.8G！我只是从这里下载该文件以进行练习。

当我使用以下代码时，会抛出相同的错误：

import json
with open('review.json') as json_file:
    data = json.load(json_file)

- ileadall42

1

你的路径/文件参数有问题。请确保该文件存在于你运行Python的文件夹中。或许可以提供更多关于如何调用脚本以及从哪里调用的细节。 - sascha

@sascha 是的，我认真检查过了，但它没有起作用。 - ileadall42

好的...我们需要更多的信息！ - sascha

1

@LukasAnsteeg 这很可能是pandas的read_json代码。 - sascha

@LukasAnsteeg 非常感谢，我的 JSON 文件不包含注释，而且错误抛出的代码行是 355 行的 read_json 代码。 - ileadall42

显示剩余10条评论

5个回答

11

也许你正在阅读的文件包含多个JSON对象，而不是json.load(json_file)和pd.read_json('review.json')所期望的单个JSON或数组对象。这些方法应该读取单个JSON对象的文件。

从yelp数据集中我看到，你的文件可能类似于：

{"review_id":"xxxxx","user_id":"xxxxx","business_id":"xxxx","stars":5,"date":"xxx-xx-xx","text":"xyxyxyxyxx","useful":0,"funny":0,"cool":0}
{"review_id":"yyyy","user_id":"yyyyy","business_id":"yyyyy","stars":3,"date":"yyyy-yy-yy","text":"ababababab","useful":0,"funny":0,"cool":0}
....    
....

and so on.

因此，重要的是要意识到这不是单个的JSON数据，而是一个文件中包含多个JSON对象。

为了将这些数据读入Pandas数据框中，以下解决方案应该有效：

import pandas as pd

with open('review.json') as json_file:      
    data = json_file.readlines()
    # this line below may take at least 8-10 minutes of processing for 4-5 million rows. It converts all strings in list to actual json objects. 
    data = list(map(json.loads, data)) 

pd.DataFrame(data)

假设数据的大小相当大，我认为您的机器将需要相当长的时间将数据加载到数据框中。

- Shaurya Mittal

2

有没有办法在pandas中不使用for循环处理每行都是一个json的大型json文件？ - devssh

1

@devssh, 请看下面的答案！只需将 lines=True 和 chunksize=<something> 传递给 pandas.read_json 函数。您仍需要循环访问 JsonReader 返回的文件内容，但必须采取某种方法来避免将整个文件加载到内存中。一些细节：http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#line-delimited-json - Chris

5

使用参数lines=True和chunksize=X将创建一个阅读器，获取特定数量的行。

然后您需要制作一个循环以显示每个块。

以下是一段代码，供您了解：

import pandas as pd
import json
chunks = pd.read_json('../input/data.json', lines=True, chunksize = 10000)
for chunk in chunks:
    print(chunk)
    break

块根据您的json长度（以行为单位）创建多个块。

例如，如果我的json有10万行，其中包含X个对象，如果我使用chunksize = 10000，则会生成10个块。

在我给出的代码中，我添加了一个break，以便只打印第一个块，但是如果您删除它，则一个接一个地拥有10个块。

- Max

4

我正在改进Max的答案，以便在不遇到内存错误的情况下将大型json文件加载到数据框中：

您可以使用以下代码，而不会遇到任何问题。

chunks = pd.read_json('/content/gdrive/My Drive/yelp/yelp_academic_dataset_review.json', lines=True, chunksize = 10000)
reviews = pd.DataFrame()
for chunk in chunks:
  reviews = pd.concat([reviews, chunk])

- Janani Sankarasubramanian

0

如果你的 JSON 文件包含多个对象而不是一个对象，则以下方法应该有效：

import json

data = []
for line in open('sample.json', 'r'):
    data.append(json.loads(line))

注意json.load和json.loads之间的区别。 json.loads()期望一个（有效的）JSON字符串 - 即{"foo": "bar"}。因此，如果您的JSON文件看起来像@Mant1c0r3提到的那样，那么使用json.loads是合适的。

- mOna

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mant1c0r3 · Accepted Answer

如果您不想使用 for 循环，以下代码应该可以解决问题：

import pandas as pd

df = pd.read_json("foo.json", lines=True)

这将处理您的JSON文件看起来类似于以下内容的情况：

{"foo": "bar"}
{"foo": "baz"}
{"foo": "qux"}

它会将其转换为一个DataFrame，该DataFrame只有一列foo，包含三行数据。

您可以在Pandas的文档中了解更多信息。