如何在两个大型列表之间建立索引关系以更快地获取列表？

Question

如何在两个大型列表之间建立索引关系以更快地获取列表？

5

问题是给定以下两个列表。

import numpy as np
import random as rnd

num = 100000

a = list(range(num))
b = [rnd.randint(0, num) for r in range(num)]

在两个巨大的列表之间（假设参考列表为a），使用列表推导式方法创建了一个列表(atob)，该列表指示相同元素在相对数组(b)中的位置。

atob = [np.abs(np.subtract(b, i)).argmin() for i in a]
print(f"index a to b: {atob}")

当列表大小较小时，这个过程并不需要很长时间。然而，我意识到获取列表atob的过程非常耗时。

有没有一种更快地获取列表atob的方法？或者目前还没有办法？

（回答后进行编辑。此次修订的目的是为了未来的读者。）非常感谢大家的回复！根据答案进行了代码分析。

检查输出

结果的比较是基于 num = 20 进行的。

import numpy as np
import random as rnd
import time

# set lists
num = 20
a = list(range(num))
# b = [rnd.randint(0, num) for r in range(num)] # Duplicate numbers occur among the elements in the list
b = rnd.sample(range(0, num), num)
print(f"list a: {a}")
print(f"list b: {b}\n")

# set array as same as lists
arr_a = np.array(range(num))
arr_b = np.array(rnd.sample(range(0, num), num))

# --------------------------------------------------------- #
# existing method
ck_time = time.time()
atob = [np.abs(np.subtract(b, i)).argmin() for i in a]
print(f"index a to b (existed): {atob}, type: {type(atob)}")
print(f"running time (existed): {time.time() - ck_time}\n")
ck_time = time.time()

# dankal444 method
dk = {val: idx for idx, val in enumerate(b)}
atob_dk = [dk.get(n) for n in a] # same as atob_dk = [d.get(n) for n in range(num)]
print(f"index a to b (dankal): {atob_dk}, type: {type(atob_dk)}")
print(f"running time (dankal): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_dk)}\n")
ck_time = time.time()

# smp55 method
comb = np.array([a, b]).transpose()
atob_smp = comb[comb[:, 1].argsort()][:, 0]
print(f"index a to b (smp55): {atob_smp}, type: {type(atob_smp)}")
print(f"running time (smp55): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_smp)}\n")
ck_time = time.time()

# Roman method
from numba import jit
@jit(nopython=True)
def find_argmin(_a, _b):
    out = np.empty_like(_a)  # allocating result array
    for i in range(_a.shape[0]):
        out[i] = np.abs(np.subtract(_b, _a[i])).argmin()
    return out

atob_rom = find_argmin(arr_a, arr_b)
print(f"index a to b (Roman): {atob_rom}, type: {type(atob_rom)}")
print(f"running time (Roman): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_rom)}\n")
ck_time = time.time()

# Alain method
from bisect import bisect_left
ub   = {n:-i for i,n in enumerate(reversed(b),1-len(b))}  # unique first pos
sb   = sorted(ub.items())                                 # sorted to bisect
ib   = (bisect_left(sb,(n,0)) for n in a)                 # index of >= val
rb   = ((sb[i-1],sb[min(i,len(sb)-1)]) for i in ib)       # low and high pairs
atob_ala = [ i if (abs(lo-n),i)<(abs(hi-n),j) else j      # closest index
               for ((lo,i),(hi,j)),n in zip(rb,a) ]
print(f"index a to b (Alain): {atob_ala}, type: {type(atob_ala)}")
print(f"running time (Alain): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_ala)}\n")
ck_time = time.time()

# ken method
b_sorted, b_sort_indices = np.unique(b, return_index=True)
def find_nearest(value):
    """Finds the nearest value from b."""
    right_index = np.searchsorted(b_sorted[:-1], value)
    left_index = max(0, right_index - 1)
    right_delta = b_sorted[right_index] - value
    left_delta = value - b_sorted[left_index]
    if right_delta == left_delta:
        # This is only necessary to replicate the behavior of your original code.
        return min(b_sort_indices[left_index], b_sort_indices[right_index])
    elif left_delta < right_delta:
        return b_sort_indices[left_index]
    else:
        return b_sort_indices[right_index]

atob_ken = [find_nearest(ai) for ai in a]
print(f"index a to b (ken): {atob_ken}, type: {type(atob_ken)}")
print(f"running time (ken): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_ken)}\n")
ck_time = time.time()

上面代码的结果是：

list a: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
list b: [9, 12, 0, 2, 3, 15, 4, 16, 13, 6, 7, 18, 14, 10, 1, 8, 5, 17, 11, 19]

index a to b (existed): [2, 14, 3, 4, 6, 16, 9, 10, 15, 0, 13, 18, 1, 8, 12, 5, 7, 17, 11, 19], type: <class 'list'>
running time (existed): 0.00024008750915527344

index a to b (dankal): [2, 14, 3, 4, 6, 16, 9, 10, 15, 0, 13, 18, 1, 8, 12, 5, 7, 17, 11, 19], type: <class 'list'>
running time (dankal): 1.5497207641601562e-05
equal to exist method: True

index a to b (smp55): [ 2 14  3  4  6 16  9 10 15  0 13 18  1  8 12  5  7 17 11 19], type: <class 'numpy.ndarray'>
running time (smp55): 0.00020551681518554688
equal to exist method: True

index a to b (Roman): [17 11  1  6 16 14  9  4  8  3  5 12  7  2 19 15 18 13  0 10], type: <class 'numpy.ndarray'>
running time (Roman): 0.5710980892181396
equal to exist method: False

index a to b (Alain): [2, 14, 3, 4, 6, 16, 9, 10, 15, 0, 13, 18, 1, 8, 12, 5, 7, 17, 11, 19], type: <class 'list'>
running time (Alain): 3.552436828613281e-05
equal to exist method: True

index a to b (ken): [2, 14, 3, 4, 6, 16, 9, 10, 15, 0, 13, 18, 1, 8, 12, 5, 7, 17, 11, 19], type: <class 'list'>
running time (ken): 0.00011754035949707031
equal to exist method: True

运行时间随列表大小增加而增加

如果我使用 num = 1000000 运行代码

running time (dankal): 0.45094847679138184

running time (smp55): 0.36011743545532227

running time (Alain): 2.178112030029297

running time (ken): 2.663684368133545

(使用Roman的方法，在尺寸增加时很难检查时间。)

从内存角度看，也需要进行检查，但首先，@smp55的方法是根据回复所需时间获取列表的最快方法。(我相信还有其他好方法。)

再次感谢大家的关注和回复！！！

(欢迎后续的回复和评论。如果有人有好主意，分享一下就好了！)

- Swani

3

np.subtract(b, i).argmin() 每次返回相同的值。这是您预期的行为吗？ - ken

1

另一个想法：构建字典以保留b值的索引，b_value_to_idx = [value: idx for idx, value in enumerate(b) - dankal444

1

我犯了一个小错误，我写了列表推导式，但实际应该是字典推导式： {value: idx for idx, value in enumerate(b)}。而且，这种逻辑是有效的，甚至比简单的for循环更快。 - dankal444

1

a和b是您要使用的实际值吗？至少b应该是一个numpy数组，通过在计算atob之前添加b = np.array(b)来实现。 - ken

1

@ken，这是我构想的算法的一部分。随机部分可以被替换为实际值。（我将 b 声明为随机数，但实际上应该没有重复数字以启用一对一映射）。此示例由我随机创建。在哪里应该加入 b = np.array(b) ？ - Swani

显示剩余6条评论

4个回答

1

正如@dankal444在评论中建议的那样，排序是一个不错的方法。以下代码完美地复制了您的代码结果，但在我的电脑上执行时间约为0.25秒。

import numpy as np
import random as rnd

num = 100000


a = list(range(num))
b = np.array([rnd.randint(0, num) for _ in range(num)])

# You can also create them with numpy like this.
# a = np.arange(num)
# b = np.random.randint(0, num, size=num)

# This will sort and remove duplicate items at the same time.
b_sorted, b_sort_indices = np.unique(b, return_index=True)


def find_nearest(value):
    """Finds the nearest value from b."""
    right_index = np.searchsorted(b_sorted[:-1], value)
    left_index = max(0, right_index - 1)
    right_delta = b_sorted[right_index] - value
    left_delta = value - b_sorted[left_index]
    if right_delta == left_delta:
        # This is only necessary to replicate the behavior of your original code.
        return min(b_sort_indices[left_index], b_sort_indices[right_index])
    elif left_delta < right_delta:
        return b_sort_indices[left_index]
    else:
        return b_sort_indices[right_index]


atob = [find_nearest(ai) for ai in a]

也许我们可以通过对 a 进行排序来进一步加快速度，但我不知道您需要多快，因此我暂时把这个作为我的答案。

- ken

1

您可以使用字典来构建与其第一次出现的索引相关联的b值列表。然后对它们进行排序，以便使用二分查找（使用bisect）。

在排序后的数据上使用二分查找，找到每个a值的位置。这将是b中下一个更高的值。

排序列表中的前一个项目给出了较低值的值和位置。

最后，根据与a中每个数字的差异选择前一个或下一个值的索引：

import random as rnd
from bisect import bisect_left
    
num = 10 # 0000

a = list(range(num))
b = [rnd.randint(0, num) for r in range(num)]


ub   = {n:-i for i,n in enumerate(reversed(b),1-len(b)) } # unique first pos
sb   = sorted(ub.items())                                 # allow bisect
ib   = ( bisect_left(sb,(n,0)) for n in a )               # index of >= val
lhb  = ( (sb[i-1],sb[min(i,len(sb)-1)]) for i in ib )     # low and high pairs
atob = [ i if (abs(lo-n),i)<(abs(hi-n),j) else j          # closest index
           for ((lo,i),(hi,j)),n in zip(lhb,a) ]

print(a)
print(b)
print(atob)

输出：

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[6, 1, 7, 9, 4, 5, 7, 4, 6, 10]
[1, 1, 1, 4, 4, 5, 0, 2, 2, 3]

num=100000 时的时间为0.19秒。

时间复杂度为O(NlogN)。

- Alain T.

0

在你的情况下，我建议使用Numba（Python的即时编译器）来加速numpy计算。

import numpy as np
import numba
from numba import jit
import random

num = 100000
a = np.array(range(num))
b = np.array([random.randint(0, num) for r in range(num)])

@jit(nopython=True)
def find_argmins(a, b):
    out = np.empty_like(a)  # allocating result array
    for i in range(a.shape[0]):
        out[i] = np.abs(np.subtract(b, a[i])).argmin()
    return out

运行中：

find_argmins(a, b)
array([69772, 69772, 32964, ...,  7648, 92904,  4006])

时间性能（在只有 2GB 内存的虚拟机上）：

%timeit find_argmins(a, b)
12 s ± 739 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

- RomanPerekhrest

a = np.arange(num)；b = np.random.randint(0, num, num) - Guimoute

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- smp55 · Accepted Answer

根据你的第一个列表只是一个索引的事实，我可以给出一个非常快速的具体答案。如果你将它们组合成一个二维数组，然后按第二个列表排序，这将按照你想要的结果顺序放置第一个列表（第二个列表的索引）：

import numpy as np
import random as rnd

num = 100000

a = list(range(num))
b = [rnd.randint(0, num) for r in range(num)]

comb = np.array([a, b]).transpose()
atob = comb[comb[:, 1].argsort()][:,0]

花费约0.08秒。现在，atob中的第一项是 a 中第一个项目出现的 b 中的索引。 atob中的第二项是 a 的第二个项目在 b 中的索引，以此类推。