如何快速地对列表进行多次索引操作？

Question

如何快速地对列表进行多次索引操作？

4

我想对一些数据进行分类，为此我想要链接Python列表的索引。简化起见，我有一个嵌套列表：

lst = [[[1], [2]], [[3, 3], [4]], [[5], [6,6,6]]]

我想遍历前两个索引的乘积，但保持第三个索引不变：

from itertools import product

for index1, index2 in product(range(3), range(2)):
    print(lst[index1][index2][0])

然而，我希望能够更通用地实现这个功能，而不需要事先知道需要多少次子结构嵌套（我想让传递给 itertools.product 的 range 数组数量可变）。

我有些困难，不知道如何将 [index1][index2][0] 通用化以接受任意数量的 indices，我能想到的最好方法是使用 functools.reduce ：

from functools import reduce

for indices in product(range(3), range(2)):
    print(reduce(list.__getitem__, indices, lst)[0])

这似乎非常复杂（比手动索引慢得多），所以我想知道是否有更好、更快的方法。我同时使用python 2.x和3.x，外部库肯定可以使用（但不应该需要NumPy或基于NumPy的包）。

- MSeifert

2

通过您的陈述“但是我想让它更通用，而不事先知道需要深入多少个子结构。”，您是否意味着列表的深度未知，即可能是n级深度/嵌套？ - Moinuddin Quadri

抱歉如果我表达不清，变量部分是ranges给出的数字(n)（因此是indices的长度）。列表的深度是“未知”的，但至少有n+1个子结构。我会更新问题。 - MSeifert

3个回答

1

我会使用Python内置的reduce来完成此操作，它似乎并不复杂，在我的测试中速度也没有太慢。

from itertools import product

for x in product(range(3), range(2)):
    rg = reduce(lambda result, index: result[index], x, lst)
    value = rg[0]

如果你担心使用 reduce 会带来时间惩罚，你可以使用 for 循环代替：

for x in product(range(3), range(2)):
    value = lst
    for index in x:
        value = value[index]
    value = value[0]

这将比手动索引慢，因为一个 for 循环需要额外的操作来确定停止条件。问题在于，对于任意深度的规范灵活性，速度优化是否值得你去做。

至于为什么要使用 reduce 而不是 for，JavaScript 社区一直存在关于是否应该在数组上使用 reduce、map、filter 函数或者使用 for 循环版本而不是它们的风格辩论，你可能想参考这个辩论来选择你所支持的一方。

使用 for 循环计时：

In [22]: stmt = '''
    ...: from itertools import product
    ...: def go():
    ...:   lst = [[[1], [2]], [[3, 3], [4]], [[5], [6,6,6]]]
    ...:   for x in product(range(3), range(2)):
    ...:     # rg = reduce(lambda result, index: result[index], x, lst)
    ...:     value = lst
    ...:     for index in x:
    ...:         value = value[index]
    ...:     value = value[0]
    ...:     # value = lst[x[0]][x[1]][0]
    ...: '''

In [23]: timeit(setup=stmt, stmt='go()', number=1000000)
Out[23]: 4.003296852111816

计时使用 `reduce`：

In [18]: stmt = '''
    ...: from itertools import product
    ...: def go():
    ...:   lst = [[[1], [2]], [[3, 3], [4]], [[5], [6,6,6]]]
    ...:   for x in product(range(3), range(2)):
    ...:     rg = reduce(lambda result, index: result[index], x, lst)
    ...:     value = rg[0]
    ...:     # value = lst[x[0]][x[1]][0]
    ...: '''

In [19]: timeit(setup=stmt, stmt='go()', number=1000000)
Out[19]: 6.164631128311157

计时与手动索引：

In [16]: stmt = '''
    ...: from itertools import product
    ...: def go():
    ...:   lst = [[[1], [2]], [[3, 3], [4]], [[5], [6,6,6]]]
    ...:   for x in product(range(3), range(2)):
    ...:     # rg = reduce(lambda result, index: result[index], x, lst)
    ...:     value = lst[x[0]][x[1]][0]
    ...: '''

In [17]: timeit(setup=stmt, stmt='go()', number=1000000)
Out[17]: 3.633723020553589

- 2ps

for循环似乎是最快和最直接的方法。谢谢！ - MSeifert

1

如何动态创建硬编制索引？

lst = [[[1], [2]], [[3, 3], [4]], [[5], [6,6,6]]]

from itertools import product

for index1, index2 in product(range(3), range(2)):
    print(lst[index1][index2][0])


# need depth info from somewhere to create hard coded indexing

prod_gen = product(range(3), range(2))

first = next(prod_gen)

indx_depth = len(first) + 1

exec( ('def IndexThisList(lst, indxl):\n' +
       '        return lst' + ''.join(('[indxl[' + str(i) + ']]' 
                                           for i in range(indx_depth)))))

# just to see what it exec'd:
print(("def IndexThisList(lst, indx_itrbl):\n" +
       "        return lst" + ''.join(('[indx_itrbl[' + str(i) + ']]' 
                                       for i in range(indx_depth)))))
# the exec is only invoked again when changing the indexing depth
# for accessing the list with its currently instantiated depth of indexing
# just use the current instance of the generated function

print(IndexThisList(lst, first + (0,)))
for itpl in prod_gen: 
    print (IndexThisList(lst, itpl + (0,)))

1
2
3
4
5
6
def IndexThisList(lst, indx_itrbl):
        return lst[indx_itrbl[0]][indx_itrbl[1]][indx_itrbl[2]]
1
2
3
4
5
6

我只是一个编程初学者，看起来我的exec应该被另一个函数包装起来以传递index_depth，但现在还无法理解。

- f5r5e5d

为什么？当您自己的代码是提供代码字符串的唯一来源时，就不存在安全问题。 - f5r5e5d

主要是因为正确设置它们很麻烦，而且更糟糕的是它们非常难以维护。 - MSeifert

“最快”并不总是最优雅的，真正需要用timeit来比较。 - f5r5e5d

我已经计时了，exec 似乎是一个瓶颈，执行 exec 大约需要100微秒，这已经比其他函数在适度大小的输入下所需的时间更长了。该函数也比 for 循环慢（与 exec 时间无关），可能是因为函数调用开销导致的。 - MSeifert

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- B. M. · Accepted Answer

我建议采用递归的方式实现。

def theshape(lst):
    l=lst
    shape=[]
    while isinstance(l,list):
                shape.append(len(l))
                l=l[0]
    return shape

该函数旨在查找您的结构的形状，直到最后一个维度都应该是规则的。

def browse(lst):
    shape=theshape(lst)
    ndim=len(shape)
    def level(l,k):
        if k==ndim:
            print(l)
        else:
            for i in range(shape[k]):
                level(l[i],k+1)
    level(lst,0)

这个函数可以递归地浏览所有层级，最小化指针变化。

一个简单的例子：

u=arange(2**6).reshape(4,2,1,2,2,1,1,2).tolist()
browse(u)
0
2
.
.
.
62

一些关于大型结构的测试（使用 print = lambda _ : None 抑制打印输出）：

def go(lst):
 for x in product(*[range(k) for k in theshape(lst)]):
    print(reduce(lambda result, index: result[index], x, lst))

In [1]: u=arange(2**21).reshape([2]*21).tolist()

In [2]: %time go(u)
Wall time: 14.8 s

In [3]: %time browse(u)
Wall time: 3.5 s

In [5]: u=arange(2**21).reshape([1]*30+[2**21]+[1]).tolist()

In [6]: %time go(u)
Wall time: 18 s

In [7]: %time browse(u)
Wall time: 3.48 s

In [8]: u=arange(2**21).reshape([1]+[2**21]+[1]*30).tolist()

In [9]: %time go(u)
Wall time: 14 s

In [10]: %time browse(u)
Wall time: 58.1 s

这表明性能与数据结构密切相关。

编辑:

最简单的方式也是最快的。theshape 不是必要的。

def browse2(lst):
        if isinstance(lst,list):
            for l in lst:
                browse2(l)
        else: print(lst)

它经常比浏览器快30％，并且无论列表的结构如何都可以使用。