如何在Python中找到字典列表中项目的累积总和

3

我有一个列表,像这样

a=[{'time':3},{'time':4},{'time':5}]

我希望能够以相反的顺序获取值的累加和,就像这样
b=[{'exp':3,'cumsum':12},{'exp':4,'cumsum':9},{'exp':5,'cumsum':5}]

最高效的方式是什么?我已经阅读了其他答案,使用numpy可以解决问题。

a=[1,2,3]
b=numpy.cumsum(a)

但我还需要将cumsum插入字典中


1
最高效的方法是使用numpy,特别是对于非常大的列表。是否需要将cumsum转换为字典取决于您的应用程序,这可能只是来自不良设计。为什么不使用pandas呢?它似乎完美地适合您的示例。 - Imanol Luengo
@ImanolLuengo,谢谢。我的列表不是很大,但长度各不相同。这样的列表数量有一百万。 - NG_21
使用 timeitcProfile 来评估解决方案。请告诉我们结果。对于将列表转换为另一种对象类型的解决方案,请确保在计时中包含该转换。 - wwii
6个回答

8
a=[{'time':3},{'time':4},{'time':5}]
b = []
cumsum = 0
for e in a[::-1]:
    cumsum += e['time']
    b.insert(0, {'exp':e['time'], 'cumsum':cumsum})
print(b)

输出:

[{'exp': 3, 'cumsum': 12}, {'exp': 4, 'cumsum': 9}, {'exp': 5, 'cumsum': 5}]


原来在列表的开头插入元素是slow (O(n))。相反,尝试使用deque (O(1)):

from collections import deque


a=[{'time':3},{'time':4},{'time':5}]
b = deque()
cumsum = 0
for e in a[::-1]:
    cumsum += e['time']
    b.appendleft({'exp':e['time'], 'cumsum':cumsum})
print(b)
print(list(b))

输出:

deque([{'cumsum': 12, 'exp': 3}, {'cumsum': 9, 'exp': 4}, {'cumsum': 5, 'exp': 5}])
[{'cumsum': 12, 'exp': 3}, {'cumsum': 9, 'exp': 4}, {'cumsum': 5, 'exp': 5}]

这里是一个测试ITT方法速度的脚本,同时还有一个带有时间结果的图表:

enter image description here

from collections import deque
from copy import deepcopy
import numpy as np
import pandas as pd
from random import randint
from time import time


def Nehal_pandas(l):
    df = pd.DataFrame(l)
    df['cumsum'] = df.ix[::-1, 'time'].cumsum()[::-1]
    df.columns = ['exp', 'cumsum']
    return df.to_json(orient='records')


def Merlin_pandas(l):
    df           = pd.DataFrame(l).rename(columns={'time':'exp'})
    df["cumsum"] = df['exp'][::-1].cumsum()
    return df.to_dict(orient='records')


def RahulKP_numpy(l):
    cumsum_list = np.cumsum([i['time'] for i in l][::-1])[::-1]
    for i,j in zip(l,cumsum_list):
        i.update({'cumsum':j})


def Divakar_pandas(l):
    df = pd.DataFrame(l)
    df.columns = ['exp']
    df['cumsum'] = (df[::-1].cumsum())[::-1]
    return df.T.to_dict().values()


def cb_insert_0(l):
    b = []
    cumsum = 0
    for e in l[::-1]:
        cumsum += e['time']
        b.insert(0, {'exp':e['time'], 'cumsum':cumsum})
    return b


def cb_deque(l):
    b = deque()
    cumsum = 0
    for e in l[::-1]:
        cumsum += e['time']
        b.appendleft({'exp':e['time'], 'cumsum':cumsum})
    b = list(b)
    return b


def cb_deque_noconvert(l):
    b = deque()
    cumsum = 0
    for e in l[::-1]:
        cumsum += e['time']
        b.appendleft({'exp':e['time'], 'cumsum':cumsum})
    return b


def hpaulj_gen(l, var='value'):
    cum=0
    for i in l:
        j=i[var]
        cum += j
        yield {var:j, 'sum':cum}


def hpaulj_inplace(l, var='time'):
    cum = 0
    for i in l:
        cum += i[var]
        i['sum'] = cum


def test(number_of_lists, min_list_length, max_list_length):
    test_lists = []

    for _ in range(number_of_lists):
        test_list = []
        number_of_dicts = randint(min_list_length,max_list_length)
        for __ in range(number_of_dicts):
            random_value = randint(0,50)
            test_list.append({'time':random_value})
        test_lists.append(test_list)

    lists = deepcopy(test_lists)
    start_time = time()
    for l in lists:
        res = list(hpaulj_gen(l[::-1], 'time'))[::-1]
    elapsed_time = time() - start_time
    print('hpaulj generator:'.ljust(25), '%.2f' % (number_of_lists / elapsed_time), 'lists per second')

    lists = deepcopy(test_lists)
    start_time = time()
    for l in lists:
        hpaulj_inplace(l[::-1])
    elapsed_time = time() - start_time
    print('hpaulj in place:'.ljust(25), '%.2f' % (number_of_lists / elapsed_time), 'lists per second')

    lists = deepcopy(test_lists)
    start_time = time()
    for l in lists:
        res = cb_insert_0(l)
    elapsed_time = time() - start_time
    print('craig insert list at 0:'.ljust(25), '%.2f' % (number_of_lists / elapsed_time), 'lists per second')

    lists = deepcopy(test_lists)
    start_time = time()
    for l in lists:
        res = cb_deque(l)
    elapsed_time = time() - start_time
    print('craig deque:'.ljust(25), '%.2f' % (number_of_lists / elapsed_time), 'lists per second')

    lists = deepcopy(test_lists)
    start_time = time()
    for l in lists:
        res = cb_deque_noconvert(l)
    elapsed_time = time() - start_time
    print('craig deque no convert:'.ljust(25), '%.2f' % (number_of_lists / elapsed_time), 'lists per second')

    lists = deepcopy(test_lists)
    start_time = time()
    for l in lists:
        RahulKP_numpy(l) # l changed in place
    elapsed_time = time() - start_time
    print('Rahul K P numpy:'.ljust(25), '%.2f' % (number_of_lists / elapsed_time), 'lists per second')

    lists = deepcopy(test_lists)
    start_time = time()
    for l in lists:
        res = Divakar_pandas(l)
    elapsed_time = time() - start_time
    print('Divakar pandas:'.ljust(25), '%.2f' % (number_of_lists / elapsed_time), 'lists per second')

    lists = deepcopy(test_lists)
    start_time = time()
    for l in lists:
        res = Nehal_pandas(l)
    elapsed_time = time() - start_time
    print('Nehal pandas:'.ljust(25), '%.2f' % (number_of_lists / elapsed_time), 'lists per second')

    lists = deepcopy(test_lists)
    start_time = time()
    for l in lists:
        res = Merlin_pandas(l)
    elapsed_time = time() - start_time
    print('Merlin pandas:'.ljust(25), '%.2f' % (number_of_lists / elapsed_time), 'lists per second')

1
谢谢@craig,但这是合理高效的方法吗?因为我有大约一百万个像这样的列表。 - NG_21
2
计时各种方法,然后告诉我们 :) - Craig Burgler

2
一个基于生成器的解决方案:
def foo(a, var='value'):
    cum=0
    for i in a:
        j=i[var]
        cum += j
        yield {var:j, 'sum':cum}

In [79]: a=[{'time':i} for i in range(5)]
In [80]: list(foo(a[::-1], var='time'))[::-1]
Out[80]: 
[{'sum': 10, 'time': 0},
 {'sum': 10, 'time': 1},
 {'sum': 9, 'time': 2},
 {'sum': 7, 'time': 3},
 {'sum': 4, 'time': 4}]

在快速的测试中,这个与cb_insert_0相竞争。

原地版本表现得更好:

def foo2(a, var='time'):
    cum = 0
    for i in a:
        cum += i[var]
        i['sum'] = cum
foo2(a[::-1])

1
原地 版本而言,加号。 - wwii
添加到时间 :) - Craig Burgler

1
尝试这个,
cumsum_list = np.cumsum([i['time'] for i in a][::-1])[::-1]
for i,j in zip(a,cumsum_list):
     i.update({'cumsum':j})

Result

[{'cumsum': 12, 'time': 3}, {'cumsum': 9, 'time': 4}, {'cumsum': 5, 'time': 5}]

效率
转换成一个函数,
In [49]: def convert_dict(a):
....:     cumsum_list = np.cumsum([i['time'] for i in a][::-1])[::-1]
....:     for i,j in zip(a,cumsum_list):
....:              i.update({'cumsum':j})
....:     return a

然后就是结果了,

In [51]: convert_dict(a)
Out[51]: [{'cumsum': 12, 'time': 3}, {'cumsum': 9, 'time': 4}, {'cumsum': 5, 'time': 5}]

最终效率,
In [52]: %timeit convert_dict(a)
The slowest run took 12.84 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 12.1 µs per loop

1
这是另一种使用 pandas 的方法 -
df = pd.DataFrame(a)
df.columns = ['exp']
df['cumsum'] = (df[::-1].cumsum())[::-1]
out = df.T.to_dict().values()

样例输入,输出 -

In [396]: a
Out[396]: [{'time': 3}, {'time': 4}, {'time': 5}]

In [397]: out
Out[397]: [{'cumsum': 12, 'exp': 3}, {'cumsum': 9, 'exp': 4}, {'cumsum': 5, 'exp': 5}

Pandas在将字典转换为数据框和再次转换的速度有多快?这肯定比cumsum花费更多的时间。 - hpaulj
@hpaulj 真的不知道!我猜看到一些时间会很好。 - Divakar

1

试试这个:

a            = [{'time':3},{'time':4},{'time':5}]
df           = pd.DataFrame(a).rename(columns={'time':'exp'})
df["cumsum"] = df['exp'][::-1].cumsum()
df.to_dict(orient='records')

字典是无序的。

 [{'cumsum': 12, 'exp': 3}, {'cumsum': 9, 'exp': 4}, {'cumsum': 5, 'exp': 5}]

0

使用 pandas

In [4]: df = pd.DataFrame([{'time':3},{'time':4},{'time':5}])

In [5]: df
Out[5]: 
   time
0     3
1     4
2     5

In [6]: df['cumsum'] = df.ix[::-1, 'time'].cumsum()[::-1]

In [7]: df
Out[7]: 
   time  cumsum
0     3      12
1     4       9
2     5       5

In [8]: df.columns = ['exp', 'cumsum']

In [9]: df
Out[9]: 
   exp  cumsum
0    3      12
1    4       9
2    5       5

In [10]: df.to_json(orient='records')
Out[10]: '[{"exp":3,"cumsum":12},{"exp":4,"cumsum":9},{"exp":5,"cumsum":5}]'

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接