总的来说,针对这个问题提出的三个解决方案中,rdas的方法可能是最优的。经过一些修改后,它比Jérôme的解决方案快57%,比原始代码快180倍。在裁剪后的结果子集(1000个条目)上,Сергей的解决方案大约慢了350倍;它似乎也很难扩展,我没有时间等待完整数据集的结果。
计时如下:
方法 时间 相对值
原始 102.83720700000413 180.4580695623077
Jérôme 0.8722491000080481 1.5306171118096032
Jérôme(更新) 0.9000219000154175 1.5793526426731528
rdas 0.6866130999987945 1.2048642527015463
rdas(更新) 0.6580462999991141 1.1547354157572032
使用defaultdict的rdas 0.5698675999883562 1.0
在Jérôme的答案中,最终结果的计算可以通过将最终列表构建换成列表推导式来改进。单独查看列表推导式更改,更新后的代码大约快38%。值得注意的是,这不是总体时间的主要贡献因素。
[{'user': user, 'db': db, 'size': total_size} for user in unique_users for db in unique_dbs if (total_size := aggregate_dict.get((user, db)))]
对几次迭代进行前后比较(Method1和Method2),结果如下:
Method1: 6.447344800049905
Method2: 4.851344000024255
Method1: 6.546811900043394
Method2: 4.816081999975722
Method1: 6.863985500007402
Method2: 4.755402299982961
Method1: 6.8024070000392385
Method2: 4.738170499971602
Method1: 6.524162200046703
Method2: 4.755566900013946
rdas的第一个更新是避免进行多个字典查找,通过缓存查找第一个字典来实现:
if not (user_dict := user_to_db_to_size.get(user)):
user_to_db_to_size[user] = { db: size }
elif not db in user_dict:
user_dict[db] = size
else:
user_dict[db] += size
这之后通过更换为 collections.defaultdict
再次改进,该方法移除了额外的昂贵字典查找操作:
user_to_db_to_size = defaultdict(lambda: defaultdict(int))
for entry in big_list:
user = entry['user']
db = entry['db']
size = int(entry['size'])
user_to_db_to_size[user][db] += size
return user_to_db_to_size
这是用于运行基准测试的代码:
import random
import timeit
from collections import defaultdict
random.seed(1)
test_iterations = 100
big_list = [{'user': random.randint(0, 100), 'db': f'db{random.randint(1, 10)}', 'size': f'{random.randint(100, 90000)}' } for i in range(10000)]
unique_users = { i['user'] for i in big_list }
unique_dbs = { i['db'] for i in big_list }
def method0():
result = []
for user in unique_users:
for db in unique_dbs:
total_size = 0
for i in big_list:
if (i['user'] == user and i['db'] == db):
total_size += float(i['size'])
if(total_size) > 0:
row = {}
row['user'] = user
row['db'] = db
row['size'] = total_size
result.append(row)
return result
def method1():
aggregate_dict = dict()
for i in big_list:
key = (i['user'], i['db'])
value = float(i['size'])
if key in aggregate_dict:
aggregate_dict[key] += value
else:
aggregate_dict[key] = value
result = []
for user in unique_users:
for db in unique_dbs:
total_size = aggregate_dict.get((user, db))
if total_size is not None and total_size > 0:
result.append({'user': user, 'db': db, 'size': total_size})
return result
def method2():
aggregate_dict = dict()
for i in big_list:
key = (i['user'], i['db'])
value = float(i['size'])
if key in aggregate_dict:
aggregate_dict[key] += value
else:
aggregate_dict[key] = value
return [{'user': user, 'db': db, 'size': total_size} for user in unique_users for db in unique_dbs if (total_size := aggregate_dict.get((user, db)))]
def method3():
user_to_db_to_size = {}
for entry in big_list:
user = entry['user']
db = entry['db']
size = int(entry['size'])
if user not in user_to_db_to_size:
user_to_db_to_size[user] = {}
if db not in user_to_db_to_size[user]:
user_to_db_to_size[user][db] = 0
user_to_db_to_size[user][db] += size
return user_to_db_to_size
def method4():
user_to_db_to_size = {}
for entry in big_list:
user = entry['user']
db = entry['db']
size = int(entry['size'])
if not (user_dict := user_to_db_to_size.get(user)):
user_to_db_to_size[user] = { db: size }
elif not db in user_dict:
user_dict[db] = size
else:
user_dict[db] += size
return user_to_db_to_size
def method5():
user_to_db_to_size = defaultdict(lambda: defaultdict(int))
for entry in big_list:
user = entry['user']
db = entry['db']
size = int(entry['size'])
user_to_db_to_size[user][db] += size
return user_to_db_to_size
assert (m0 := method0()) == method1()
print('Method1 OK')
assert m0 == method2()
print('Method2 OK')
assert all(v['size'] == method3()[v['user']][v['db']] for v in m0)
print('Method3 OK')
assert all(v['size'] == method4()[v['user']][v['db']] for v in m0)
print('Method4 OK')
assert all(v['size'] == method5()[v['user']][v['db']] for v in m0)
print('Method5 OK')
t0 = timeit.timeit(method0, number=test_iterations)
t1 = timeit.timeit(method1, number=test_iterations)
t2 = timeit.timeit(method2, number=test_iterations)
t3 = timeit.timeit(method3, number=test_iterations)
t4 = timeit.timeit(method4, number=test_iterations)
t5 = timeit.timeit(method5, number=test_iterations)
tmin = min((t0, t1, t2, t3, t4, t5))
print(f'| Method | Time | Relative |')
print(f'|------------------|----------------------|')
print(f'| Original | {t0} | {t0 / tmin} |')
print(f'| Jérôme | {t1} | {t1 / tmin} |')
print(f'| Jérôme (updated) | {t2} | {t2 / tmin} |')
print(f'| rdas | {t3} | {t3 / tmin} |')
print(f'| rdas (updated) | {t4} | {t4 / tmin} |')
print(f'| defaultdict | {t5} | {t5 / tmin} |')
结果为:
Method1 OK
Method2 OK
Method3 OK
Method4 OK
Method4 OK
| Method | Time | Relative |
|------------------|----------------------|
| Original | 102.83720700000413 | 180.4580695623077 |
| Jérôme | 0.8722491000080481 | 1.5306171118096032 |
| Jérôme (updated) | 0.9000219000154175 | 1.5793526426731528 |
| rdas | 0.6866130999987945 | 1.2048642527015463 |
| rdas (updated) | 0.6580462999991141 | 1.1547354157572032 |
| defaultdict | 0.5698675999883562 | 1.0 |
注意:Сергей的方法大致相当于:
def c():
tmp = Counter()
for d in big_list:
tmp = tmp + Counter(d)
return tmp
因此,在循环中每次都会创建一个新的字典。对于循环中的每次迭代,它创建的字典将会更大。我怀疑这样的规模是 N ** 2
或更糟糕。