从大型文本文件中提取JSON字符串的Python方法

Question

从大型文本文件中提取JSON字符串的Python方法

3

我正在处理一个需要大数据集的项目。我找到了一个足够大的数据集（在https://openlibrary.org/developers/dumps上的版本转储，约5GB），这个数据集已经格式化好了。

/type/edition   /books/OL10000135M  4   2010-04-24T17:54:01.503315  {"publishers": ["Bernan Press"], "physical_format": "Hardcover", "subtitle": "9th November - 3rd December, 1992", "key": "/books/OL10000135M", "title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93", "identifiers": {"goodreads": ["6850240"]}, "isbn_13": ["9780107805401"], "languages": [{"key": "/languages/eng"}], "number_of_pages": 64, "isbn_10": ["0107805405"], "publish_date": "December 1993", "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "authors": [{"key": "/authors/OL2645777A"}], "latest_revision": 4, "works": [{"key": "/works/OL7925046W"}], "type": {"key": "/type/edition"}, "subjects": ["Government - Comparative", "Politics / Current Events"], "revision": 4} 
/type/edition   /books/OL10000179M  4   2010-04-24T17:54:01.503315  {"publishers": ["Stationery Office"], "physical_format": "Hardcover", "subtitle": "26 January - 4 February 1998", "title": "Parliamentary Debates, House of Lords, 1997-98", "isbn_10": ["0107805855"], "identifiers": {"goodreads": ["2862283"]}, "isbn_13": ["9780107805852"], "edition_name": "5th edition", "languages": [{"key": "/languages/eng"}], "number_of_pages": 124, "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "latest_revision": 4, "key": "/books/OL10000179M", "authors": [{"key": "/authors/OL2645811A"}], "publish_date": "January 1999", "works": [{"key": "/works/OL7925994W"}], "type": {"key": "/type/edition"}, "subjects": ["Bibliographies, catalogues, discographies", "POLITICS & GOVERNMENT", "Reference works", "Bibliographies & Indexes", "Reference"], "revision": 4}
 etc...

我想提取JSON部分（第五个字段）。

我试图在大文件的50行子集上使用str.replace()，但它很棘手。我认为这样做应该可以，但是它没有起作用（没有任何内容被更改/替换）

 with fileinput.input(files=("testData.txt"), inplace=True, backup='.bak') as file:
    for line in file:
            print(line.replace(".*({.*})$", "\1"), end="")

我尝试按列解析它（使用识别每列的正则表达式），但是我遇到了一个让我困惑的问题。以下是代码：

 with fileinput.input(files=("testData.txt"), inplace=True, backup='.bak') as file:
    for line in file:
            print(line.replace("/type/edition\t/books/", "WORK PLZ"), end="")

产量

 WORK PLZOL10000135M    4   2010-04-24T17:54:01.503315  {"publishers": ["Bernan Press"], "physical_format": "Hardcover", "subtitle": "9th November - 3rd December, 1992", "key": "/books/OL10000135M", "title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93", "identifiers": {"goodreads": ["6850240"]}, "isbn_13": ["9780107805401"], "languages": [{"key": "/languages/eng"}], "number_of_pages": 64, "isbn_10": ["0107805405"], "publish_date": "December 1993", "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "authors": [{"key": "/authors/OL2645777A"}], "latest_revision": 4, "works": [{"key": "/works/OL7925046W"}], "type": {"key": "/type/edition"}, "subjects": ["Government - Comparative", "Politics / Current Events"], "revision": 4}
 WORK PLZOL10000179M    4   2010-04-24T17:54:01.503315  {"publishers": ["Stationery Office"], "physical_format": "Hardcover", "subtitle": "26 January - 4 February 1998", "title": "Parliamentary Debates, House of Lords, 1997-98", "isbn_10": ["0107805855"], "identifiers": {"goodreads": ["2862283"]}, "isbn_13": ["9780107805852"], "edition_name": "5th edition", "languages": [{"key": "/languages/eng"}], "number_of_pages": 124, "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "latest_revision": 4, "key": "/books/OL10000179M", "authors": [{"key": "/authors/OL2645811A"}], "publish_date": "January 1999", "works": [{"key": "/works/OL7925994W"}], "type": {"key": "/type/edition"}, "subjects": ["Bibliographies, catalogues, discographies", "POLITICS & GOVERNMENT", "Reference works", "Bibliographies & Indexes", "Reference"], "revision": 4}

但是

 with fileinput.input(files=("testData.txt"), inplace=True, backup='.bak') as file:
    for line in file:
            print(line.replace("/type/edition\t/books/\w+", "WORK PLZ"), end="")

没有做任何事情。似乎\w+不能捕获/books/后的字母数字字符串。

我的正则表达式有错误吗？有更好的方法吗？

- Tom

1

str 方法不考虑正则表达式，您需要使用 re 模块。 - Alex Hall

这些列是用制表符分隔的吗？ - user3483203

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jean-François Fabre · Accepted Answer

（如评论中所述）str.replace不理解正则表达式。这就解释了为什么你的代码失败了。

我会将字符串分割（假设在json字符串之前没有任何{字符），然后解析为json：

import json
with open("test.txt") as f:
    for line in f:
        json_expr = "{"+line.partition("{")[2]
        the_dict = json.loads(json_expr)

根据空格拆分，但使用maxsplit参数限制拆分并获取最后一个元素（json数据）。由于json表达式是最后一个项目，因此它有效。

json_expr = line.split(None,4)[-1]