在Python中过滤字典列表的更好方法

3

我有一个字典列表,其结构类似于以下示例:

log = [{'user_id': 'id1', 'action': 'action1', 'timestamp': 'time1'},  
       {'user_id': 'id2', 'action': 'action2', 'timestamp': 'time2'},
       ...]

并且按时间戳排序。

我想删除同一用户执行的连续相同操作,仅保留第一个操作,例如,如果我有以下列表:

log = [{'user_id': 'id1', 'action': 'action1', 'timestamp': 'time1'},
       {'user_id': 'id1', 'action': 'action1', 'timestamp': 'time2'},
       {'user_id': 'id1', 'action': 'action1', 'timestamp': 'time3'},
       {'user_id': 'id2', 'action': 'action2', 'timestamp': 'time4'},
       {'user_id': 'id3', 'action': 'action2', 'timestamp': 'time5'},
       {'user_id': 'id3', 'action': 'action2', 'timestamp': 'time6'},
       {'user_id': 'id1', 'action': 'action1', 'timestamp': 'time7'},
       {'user_id': 'id1', 'action': 'action1', 'timestamp': 'time8'}]

我希望能够得到以下列表作为结果:
log = [{'user_id': 'id1', 'action': 'action1', 'timestamp': 'time1'},
       {'user_id': 'id2', 'action': 'action2', 'timestamp': 'time4'},
       {'user_id': 'id3', 'action': 'action2', 'timestamp': 'time5'},
       {'user_id': 'id1', 'action': 'action1', 'timestamp': 'time7'}]

目前我是这样做的:

def merge_actions(log):
    merged_log = []
    merged_log.append(log[0])
    for i in range(1, len(log)):
        if log[i]['user_id'] == log[i-1]['user_id']:
            if log[i]['action'] == log[i-1]['action']:
                continue
        merged_log.append(log[i])
    return merged_log

有更好的方法吗?

4个回答

6
如果您使用 itertools.groupby 并按照 'user_id''action' 进行分组,您可以从每个组中抓取第一个元素。
>>> [next(group) for key, group in itertools.groupby(log, key = lambda i: (i['user_id'], i['action']))]
[{'timestamp': 'time1', 'action': 'action1', 'user_id': 'id1'},
 {'timestamp': 'time4', 'action': 'action2', 'user_id': 'id2'},
 {'timestamp': 'time5', 'action': 'action2', 'user_id': 'id3'},
 {'timestamp': 'time7', 'action': 'action1', 'user_id': 'id1'}]

2
key=itemgetter('user_id', 'action')可以在不需要lambda的情况下完成相同的工作。 - Padraic Cunningham

3
使用 itertools.groupby 将相同用户的连续操作分组,然后取每个组的第一个元素:
def merge_actions(log):
    return [next(group) for key, group in itertools.groupby(log, lambda l: (l['user_id'], l['action']))

2
如果您使用循环,只需简单地跟踪您看到的最后一个键即可:
it = iter(log)
start = next(it)
od,prev = [start], start["user_id"]
for d in it:
    k = d["user_id"]
    if prev != k:
        od.append(d)
    prev = k

print(od)

[{'action': 'action1', 'timestamp': 'time1', 'user_id': 'id1'}, 
 {'action': 'action2', 'timestamp': 'time4', 'user_id': 'id2'}, 
{'action': 'action2', 'timestamp': 'time5', 'user_id': 'id3'}, 
{'action': 'action1', 'timestamp': 'time7', 'user_id': 'id1'}]

如果操作没有始终分组,请同时检查两个键:

it = iter(log)
start = next(it)
od, prev,act = [start], start["user_id"],start["action"]
for d in it:
    k1, k2 = d["user_id"], d["action"]
    if prev != k1 or k2 != act:
        od.append(d)
    prev, act = k1, k2

1
这是一个使用groupby的冗长尝试:
from itertools import groupby
a = [{'user_id': 'id1', 'action': 'action1', 'timestamp': 'time1'},
       {'user_id': 'id1', 'action': 'action1', 'timestamp': 'time2'},
       {'user_id': 'id1', 'action': 'action1', 'timestamp': 'time3'},
       {'user_id': 'id2', 'action': 'action2', 'timestamp': 'time4'},
       {'user_id': 'id3', 'action': 'action2', 'timestamp': 'time5'},
       {'user_id': 'id3', 'action': 'action2', 'timestamp': 'time6'},
       {'user_id': 'id1', 'action': 'action1', 'timestamp': 'time7'},
       {'user_id': 'id1', 'action': 'action1', 'timestamp': 'time8'}]
for u, grps in groupby(a, lambda d: d['user_id']):
    d_with_first_ts = sorted(grps, key = lambda user_dict: user_dict['timestamp'])[0]
    print('User: {}; Dict with first timestamp = {}'.format(u, d_with_first_ts))

您将获得以下结果:
User: id1; Dict with first timestamp = {'timestamp': 'time1', 'action': 'action1', 'user_id': 'id1'}
User: id2; Dict with first timestamp = {'timestamp': 'time4', 'action': 'action2', 'user_id': 'id2'}
User: id3; Dict with first timestamp = {'timestamp': 'time5', 'action': 'action2', 'user_id': 'id3'}
User: id1; Dict with first timestamp = {'timestamp': 'time7', 'action': 'action1', 'user_id': 'id1'}

1
你的示例将所有用户的操作都分组了,而我只想合并连续的操作,即结果应该包含{'user_id': 'id1', 'action': 'action1', 'timestamp': 'time7'} - yana
在这种情况下,您可以忽略 sorted_users 行。我会更新代码。 - user142650

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接