如何高效地从列表字典中提取元素？

Question

如何高效地从列表字典中提取元素？

4

这是我的起始词典：

dic = {'key1': [2,3],
       'key2': [5,1],
       'key3': [6,8]}

注意：下面的示例中，我使用简单的数字2、3等来说明（我的侧面DataFrame列表）。

对于每个键，我想提取第一个元素并获得以下结果：

dic2 = {'key1': 2,
        'key2': 5,
        'key3': 6}

是否可能在不使用缓慢的for循环的情况下完成它？该字典相当大...

非常感谢您的帮助。

- plonfat

3

你认为为什么for循环会变慢？除了以某种方式循环遍历字典之外，你还有什么其他的选择？！（提示：这段话是在讨论编程中的for循环） - juanpa.arrivillaga

1

这个字典相当大，有多大？ - juanpa.arrivillaga

1

如果您在此转换后只希望访问字典的几个键，那么您可以编写一个函数仅返回第一个元素，并使用该函数代替调用 dic2.get(...)。 - Selcuk

或者，确实，一个包装器 - Jiří Baum

@juanpa.arrivillaga 大约有12k个键包含12k个统计数据框。 - plonfat

1

@plonfat 这很小。循环遍历12k个键并手动创建副本只需要不到一毫秒的时间。 - juanpa.arrivillaga

5个回答

4

一种很好的方法是使用字典推导式：

{k: v[0] for k, v in dic.items()}

或者使用operator.itemgetter：

>>> from operator import itemgetter
>>> dict(zip(dic, map(itemgetter(0), dic.values())))
{'key1': 2, 'key2': 5, 'key3': 6}
>>>

- U13-Forward

请注意，以下代码仍然使用 for 循环。 - juanpa.arrivillaga

3

没错，这仍然基本上使用了for循环，只不过是在解释器级别下面的一个。它不会比常规循环快多少。 - juanpa.arrivillaga

2

@plonfat它们都需要相似的时间，类似于等效的for循环。列表推导和map 不是为了提高性能。 - juanpa.arrivillaga

@juanpa.arrivillaga 在我的基准测试答案中，for循环甚至胜出 :-) - no comment

@don'ttalkjustcode 不，对于 n-数组测试，我使用了你的 for k, (dic2[k], *_)，这是最有趣的方法，你提到的那个与 U12 的第一个答案完全相同，因此执行效果完全一样。 - Marco D.G.

显示剩余6条评论

2

个人建议使用一个在Cython下运行的库来实现这一点：cytoolz

pip3 install cytoolz

from cytoolz import valmap , first

dic = {'key1': [2,3],
       'key2': [5,1],
       'key3': [6,8]}


dic2 = valmap(first, dic)

哪种方案是最好的？

我将使用我的函数基准测试扩展 @don'ttalkjustcode 的测试，我仍在尝试弄清如何测试 @Jiří Baum 的代码。

一般情况（n个元素的数组）：@U12-Forward 方案1

使用 @don'ttalkjustcode 的通用代码 for k, (dic2[k], *_)

3个元素的字典：

    457 ns      457 ns      467 ns  U12_Forward_1
    775 ns      775 ns      776 ns  U12_Forward_2
   1021 ns     1021 ns     1036 ns  user1740577
    430 ns      430 ns      432 ns  Marco_DG
    679 ns      679 ns      683 ns  dont_talk_just_code

12k元素字典:

 992967 ns   997872 ns   998554 ns  U12_Forward_1
1251728 ns  1254163 ns  1254897 ns  U12_Forward_2
1434998 ns  1436245 ns  1440789 ns  user1740577
1219357 ns  1219453 ns  1225301 ns  Marco_DG
2208451 ns  2213086 ns  2214531 ns  dont_talk_just_code

特殊情况（2个元素数组）：@donttalkjustcode 的解决方案

3个元素字典：

    422 ns      422 ns      422 ns  marco_dg
    462 ns      462 ns      462 ns  U12_Forward_1
    765 ns      766 ns      769 ns  U12_Forward_2
   1076 ns     1081 ns     1088 ns  user1740577
    341 ns      341 ns      341 ns  dont_talk_just_code

12k元素字典：

1206537 ns  1208705 ns  1211105 ns  marco_dg
1009374 ns  1011324 ns  1011989 ns  U12_Forward_1
1232356 ns  1232728 ns  1251990 ns  U12_Forward_2
1380953 ns  1382381 ns  1390140 ns  user1740577
 848863 ns   850010 ns   850450 ns  dont_talk_just_code

- Marco D.G.

2

让我们看看这里的for循环有多“慢”。我的解决方案：

dic2 = {}
for k, (dic2[k], _) in dic.items():
    pass

使用你的玩具词典进行基准测试：

    600 ns      602 ns      603 ns  U12_Forward_1
   1019 ns     1025 ns     1027 ns  U12_Forward_2
   1347 ns     1350 ns     1355 ns  user1740577
    441 ns      442 ns      443 ns  dont_talk_just_code

使用您评论中提到的包含12k个项目的“大型”字典进行基准测试：

1412624 ns  1414927 ns  1418089 ns  U12_Forward_1
1687464 ns  1690134 ns  1696759 ns  U12_Forward_2
1961205 ns  1986729 ns  2005884 ns  user1740577
1248901 ns  1260306 ns  1261295 ns  dont_talk_just_code

上述内容是在我使用其高速和稳定性的tio.run上完成的。遗憾的是，它不提供Marco答案所需的cytoolz，因此我无法包含它。然后@user1740577指责我撒谎，因为我没有包含它，所以这里是从replit.com得到的结果，我可以在那里运行它（请注意，如果您在那里运行它并且没有付费帐户，则时间会更慢）。

    475 ns      476 ns      484 ns  U12_Forward_1
    804 ns      805 ns      807 ns  U12_Forward_2
   1075 ns     1079 ns     1082 ns  user1740577
    442 ns      444 ns      448 ns  Marco_DG
    360 ns      360 ns      360 ns  dont_talk_just_code

1060461 ns  1061449 ns  1071588 ns  U12_Forward_1
1294079 ns  1330157 ns  1706065 ns  U12_Forward_2
1593082 ns  1594114 ns  1596703 ns  user1740577
1268663 ns  1274264 ns  1286715 ns  Marco_DG
 964445 ns   965971 ns   966333 ns  dont_talk_just_code

完整的基准测试代码（也可在replit上找到）：

from timeit import repeat
from functools import partial
from operator import itemgetter
from cytoolz import valmap , first

def U12_Forward_1(dic):
    return {k: v[0] for k, v in dic.items()}

def U12_Forward_2(dic):
    return dict(zip(dic, map(itemgetter(0), dic.values())))

def user1740577(dic):
    return dict(zip(dic.keys(),list(list(zip(*dic.values()))[0])))

def Marco_DG(dic):
    return valmap(first, dic)

def dont_talk_just_code(dic):
    dic2 = {}
    for k, (dic2[k], _) in dic.items():
        pass
    return dic2

funcs = U12_Forward_1, U12_Forward_2, user1740577, Marco_DG, dont_talk_just_code

def bench(dic, number):
    expect = funcs[0](dic)
    for func in funcs:
        result = func(dic)
        print(result == expect, func.__name__)
    print()

    for _ in range(3):
        for func in funcs:
            ts = sorted(repeat(partial(func, dic), number=number))[:3]
            print(*('%7d ns ' % (t / number * 1e9) for t in ts), func.__name__)
        print()

bench({'key1': [2,3], 'key2': [5,1], 'key3': [6,8]}, 100000)
bench({f'key{i}': [i,42] for i in range(12000)}, 30)

- no comment

评论不适合进行长时间的讨论；此对话已被移至聊天室。 - user229044

那个循环是如何工作的？为什么 dic2[k] 在那个循环语句中会将第一个元素分配给该键？因为通常我们需要说 dic2[k] = foo。 - Karl Wilhelm

1

@KarlWilhelm 在 Python 中，“这就是赋值的工作方式”。它是 for 循环目标列表中的一个目标，因此它被赋值。从 Python 的角度来看，这并不算什么不寻常的事情。只是在某种意义上，大多数人都选择了像你的 foo 这样的变量绕了个弯路，这才显得不寻常。 - no comment

1

你可以尝试这个：

你可以试一下：

dict(zip(dic.keys(),list(list(zip(*dic.values()))[0])))

输出：

{'key1': 2, 'key2': 5, 'key3': 6}

- I'mahdi

1

很棒的想法！这个解决方案会比上面提出的更快吗？ - plonfat

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jiří Baum · Accepted Answer

如果你希望在这个转换后只访问字典的一些键，那么你可以编写一个包装器，类似于：

class ViewFirst:
  def __init__(self, original):
    self.original = original
  def __getitem__(self, key):
    return self.original[key][0]

另一个选项是基于defaultdict实现；这将允许您在仍从原始字典中检索其他值的同时，将新值分配到字典中（新或现有键）：

class DictFromFirsts(collections.defaultdict):
  def __init__(self, original):
    self.original = original
  def __missing__(self, key):
    return self.original[key][0]

编辑：根据评论讨论，这是一种特殊目的的方法，适用于特定情况。对于通用用途，请优先考虑其他答案中的方法，例如U12-Forward的字典推导式{k: v[0] for k, v in dic.items()}；这更清晰简单，通常更重要。