如何从多个JSON文件中删除重复项?

3

我有多个包含国家和首都的JSON文件。如何从所有文件中删除重复的键值对?

我有以下一种JSON文件之一:

{
    "data": [
    {
        "Capital": "Berlin",
        "Country": "Germany"
    },
    {
        "Capital": "New Delhi",
        "Country": "India"
    },
    {
        "Capital": "Canberra",
        "Country": "Australia"
    },
    {
        "Capital": "Beijing.",
        "Country": "China"
    },
    {
        "Capital": "Tokyo",
        "Country": "Japan"
    },
    {
        "Capital": "Tokyo",
        "Country": "Japan"
    },
    {
        "Capital": "Berlin",
        "Country": "Germany"
    },
    {
        "Capital": "Moscow",
        "Country": "Russia"
    },
    {
        "Capital": "New Delhi",
        "Country": "India"
    },
    {
        "Capital": "Ottawa",
        "Country": "Canada"
    }
    ]

}

有许多包含重复项的JSON文件。我该如何删除重复项,只保留第一次出现的项?我尝试过这个方法,但没有成功。

dupes = []
for f in json_files:
    with open(f) as json_data:
        nations = json.load(json_data)['data']
        #takes care of duplicates and stores it in dupes
        dupes.append(x for x in nations if x['Capital'] in seen or seen.add(x['Capital']))
        nations = [x for x in nations if x not in dupes] #want to keep the first occurance of the item present in dupes

    with open(f, 'w') as json_data:
        json.dump({'data': nations}, json_data)
3个回答

2
你可能无法使用酷炫的列表推导式,但普通循环应该可行。
used_nations = {} 
for nation in nations:
    if nation['Capital'] in used_nations:
        nations.remove(nation)
    else:
        used_nations.add(nation['Capital']) 

@nutmeg64 我相信不久之后会有人制作出 python.js ;) - jpyams

1
列表推导式非常棒!但是...当其中涉及到if语句时,它们可能会使代码更加复杂。
这绝不是一个固定的规则。相反,我鼓励您经常使用列表推导式。在这种特殊情况下,更分散的解决方案更易读。
我的建议如下:
import json

seen = []
result = []

with open('data.json') as json_data:
    nations = json.load(json_data)['data']
    #takes care of duplicates and stores it in dupes
    for item in nations:
        if item['Capital'] not in seen:
            seen.append(item['Capital'])
            result.append(item)

with open('data.no_dup.json', 'w') as json_data:
    json.dump({'data': result}, json_data)

已在Python 3.5.2上测试并运行。

请注意,为了方便起见,我已删除了您的外部循环。


你的代码对于我想要实现的目标非常有效。谢谢! - Souvik Ray

0
以下是一个示例代码,展示如何根据你的 JSON 实现这一点。
import json

files = ['countries.json']

for f in files:
    with open(f,'r') as fp:
        nations = json.load(fp)
    result = [dict(tupleized) for tupleized in set(tuple(item.items())\
            for item in nations['data'])]
print result
print len(result)

输出:

[{u'Country': u'Russia', u'Capital': u'Moscow'}, {u'Country': u'Japan', u'Capital': u'Tokyo'}, {u'Country': u'Canada', u'Capital': u'Ottawa'}, {u'Country': u'India', u'Capital': u'New Delhi'}, {u'Country': u'Germany', u'Capital': u'Berlin'}, {u'Country': u'Australia', u'Capital': u'Canberra'}, {u'Country': u'China', u'Capital': u'Beijing.'}]
7

请注意,这只会过滤掉重复的键值对,因此{'Country': 'Russia', 'Capital': 'Moscow'}{'Country': 'Zaire', 'Capital': 'Moscow'}都将出现在result中。 - jpyams

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接