为什么按顺序将键插入Python字典比无序插入更快？

Question

为什么按顺序将键插入Python字典比无序插入更快？

5

我一直在创建巨大的字典（数百万个条目），我注意到如果按顺序创建它们，速度会更快。

我想这可能与哈希函数的冲突有关，但是有人可以解释一下为什么会发生这种情况以及是否在 Python 的不同版本中一致吗？

这里有一个人工示例：

import timeit
import random

def get_test_data(num, size):
    olist, ulist = [], []
    for _ in range(num):
        otest = [str(i) for i in range(size)]
        utest = list(otest)
        random.shuffle(utest)
        olist.append(otest)
        ulist.append(utest)
    return olist, ulist

NUM_TESTS = 20
# Precalculate the test data so we only measure dict creation time
ordered, unordered = get_test_data(NUM_TESTS, 1000000)

def test_ordered():
    dict((k, k) for k in ordered.pop())

def test_unordered():
    dict((k, k) for k in unordered.pop())

print "unordered: ",
print timeit.timeit("test_unordered()",
                    setup="from __main__ import test_unordered, test_ordered",
                    number=NUM_TESTS)
print "ordered: ",
print timeit.timeit("test_ordered()",
                    setup="from __main__ import test_unordered, test_ordered",
                    number=NUM_TESTS)

我的机器输出结果一直是：

(X)$ python /tmp/test.py 
unordered:  8.60760807991
ordered:  5.1214389801

我正在使用Ubuntu Precise x86_64中的Python 2.7.3

- barracel

1

可能相关：为什么处理已排序的数组比未排序的数组更快？ - Ashwini Chaudhary

可能有关联，但我们应该查看字典的C实现。 - barracel

2个回答

2

检查Python字典的源代码，你可以看到连续的字符串或整数会导致更少的冲突。这与@skishore的评论有关，他提到了更好的缓存局部性可能是答案。

Major subtleties ahead: Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't: its most important hash functions (for strings and ints) are very regular in common cases:
>>> map(hash, (0, 1, 2, 3))
[0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
[-1658398457, -1658398460, -1658398459, -1658398462]
>>>
This isn't necessarily bad! To the contrary, in a table of size 2**i, taking the low-order i bits as the initial table index is extremely fast, and there are no collisions at all for dicts indexed by a contiguous range of ints. The same is approximately true when keys are "consecutive" strings. So this gives better-than-random behavior in common cases, and that's very desirable.

- barracel

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- disatisfieddinosaur · Accepted Answer

我几乎可以确定发生了以下情况：当您首次创建“otest”时，字符串按顺序存储在内存中。当您创建“utest”时，这些字符串指向相同的内存缓冲区，但是现在这些位置已经无序，这会影响无序测试用例的缓存性能。

这是证据。我用以下版本替换了您的“get_test_data”函数：

def get_test_data(num, size):
    olist, ulist = [], []
    for _ in range(num):
        nums = range(size)
        random.shuffle(nums)
        utest = [str(i) for i in nums]
        otest = list(utest)
        otest.sort(key=lambda x: int(x))
        olist.append(otest)
        ulist.append(utest)
    return olist, ulist

我的想法是，我现在正在按顺序在内存中构造ulist的字符串，然后使用适当的键将那些字符串排序，以构建olist。在我的机器上，这反转了两个测试的运行时间。