小的集合在内存中是如何存储的?

13

如果我们观察包含不到50k个元素的集合的调整大小行为:

>>> import sys
>>> s = set()
>>> seen = {}
>>> for i in range(50_000):
...     size = sys.getsizeof(s)
...     if size not in seen:
...         seen[size] = len(s)
...         print(f"{size=} {len(s)=}")
...     s.add(i)
... 
size=216 len(s)=0
size=728 len(s)=5
size=2264 len(s)=19
size=8408 len(s)=77
size=32984 len(s)=307
size=131288 len(s)=1229
size=524504 len(s)=4915
size=2097368 len(s)=19661

这种模式与当集合占据3/5时背景存储大小增加四倍的quadrupling of the backing storage size once the set is 3/5ths full相一致,再加上PySetObject的一些固定开销:

>>> for i in range(9, 22, 2):
...     print(2**i + 216)
... 
728
2264
8408
32984
131288
524504
2097368

即使对于更大的集合,类似的模式仍然存在,但是调整大小因子从四倍变为两倍。

小型集合的报告大小是异常值。sys.getsizeof 报告的大小为216字节,而不是344字节,即16 * 8 + 216(新创建的空集合的存储数组有8个可用插槽,直到第一次调整大小为32个插槽)。

我错过了什么?这些小集合如何存储,以便它们只使用216字节而不是344字节?

1个回答

7

Python中的set对象由以下C结构表示。

typedef struct {
    PyObject_HEAD

    Py_ssize_t fill;            /* Number of active and dummy entries*/
    Py_ssize_t used;            /* Number of active entries */

    /* The table contains mask + 1 slots, and that's a power of 2.
     * We store the mask instead of the size because the mask is more
     * frequently needed.
     */
    Py_ssize_t mask;

    /* The table points to a fixed-size smalltable for small tables
     * or to additional malloc'ed memory for bigger tables.
     * The table pointer is never NULL which saves us from repeated
     * runtime null-tests.
     */
    setentry *table;
    Py_hash_t hash;             /* Only used by frozenset objects */
    Py_ssize_t finger;          /* Search finger for pop() */

    setentry smalltable[PySet_MINSIZE];
    PyObject *weakreflist;      /* List of weak references */
} PySetObject;

现在记住,getsizeof() 调用对象的__sizeof__方法,并在对象由垃圾收集器管理时增加额外的垃圾收集器开销

好的,set实现了__sizeof__

static PyObject *
set_sizeof(PySetObject *so, PyObject *Py_UNUSED(ignored))
{
    Py_ssize_t res;

    res = _PyObject_SIZE(Py_TYPE(so));
    if (so->table != so->smalltable)
        res = res + (so->mask + 1) * sizeof(setentry);
    return PyLong_FromSsize_t(res);
}

现在让我们来检查一下这行代码。
res = _PyObject_SIZE(Py_TYPE(so));
_PyObject_SIZE只是一个宏,它会展开为(typeobj)->tp_basicsize
#define _PyObject_SIZE(typeobj) ( (typeobj)->tp_basicsize )

这段代码本质上是试图访问tp_basicsize槽位,以获取该类型实例的字节大小,而在set情况下,它只是sizeof(PySetObject)
PyTypeObject PySet_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "set",                              /* tp_name */
    sizeof(PySetObject),                /* tp_basicsize */
    0,                                  /* tp_itemsize */
    # Skipped rest of the code for brevity.

我已经对set_sizeof C函数进行了以下改动。
static PyObject *
set_sizeof(PySetObject *so, PyObject *Py_UNUSED(ignored))
{
    Py_ssize_t res;

    unsigned long py_object_head_size = sizeof(so->ob_base); // Because PyObject_HEAD expands to PyObject ob_base;
    unsigned long fill_size = sizeof(so->fill);
    unsigned long used_size = sizeof(so->used);
    unsigned long mask_size = sizeof(so->mask);
    unsigned long table_size = sizeof(so->table);
    unsigned long hash_size = sizeof(so->hash);
    unsigned long finger_size = sizeof(so->finger);
    unsigned long smalltable_size = sizeof(so->smalltable);
    unsigned long weakreflist_size = sizeof(so->weakreflist);
    int is_using_fixed_size_smalltables = so->table == so->smalltable;

    printf("| PySetObject Fields   | Size(bytes) |\n");
    printf("|------------------------------------|\n");
    printf("|    PyObject_HEAD     |     '%zu'    |\n", py_object_head_size);
    printf("|      fill            |      '%zu'    |\n", fill_size);
    printf("|      used            |      '%zu'    |\n", used_size);
    printf("|      mask            |      '%zu'    |\n", mask_size);
    printf("|      table           |      '%zu'    |\n", table_size);
    printf("|      hash            |      '%zu'    |\n", hash_size);
    printf("|      finger          |      '%zu'    |\n", finger_size);
    printf("|    smalltable        |    '%zu'    |\n", smalltable_size);
    printf("|    weakreflist       |      '%zu'    |\n", weakreflist_size);
    printf("-------------------------------------|\n");
    printf("|       Total          |    '%zu'    |\n", py_object_head_size+fill_size+used_size+mask_size+table_size+hash_size+finger_size+smalltable_size+weakreflist_size);
    printf("\n");
    printf("Total size of PySetObject '%zu' bytes\n", sizeof(PySetObject));
    printf("Has set resized: '%s'\n", is_using_fixed_size_smalltables ? "No": "Yes");
    if(!is_using_fixed_size_smalltables) {
        printf("Size of malloc'ed table: '%zu' bytes\n", (so->mask + 1) * sizeof(setentry));
    }

    res = _PyObject_SIZE(Py_TYPE(so));
    if (so->table != so->smalltable)
        res = res + (so->mask + 1) * sizeof(setentry);
    return PyLong_FromSsize_t(res);
}

编译和运行这些更改后给我

>>> import sys
>>>
>>> set_ = set()
>>> sys.getsizeof(set_)
| PySetObject Fields   | Size(bytes) |
|------------------------------------|
|    PyObject_HEAD     |     '16'    |
|      fill            |      '8'    |
|      used            |      '8'    |
|      mask            |      '8'    |
|      table           |      '8'    |
|      hash            |      '8'    |
|      finger          |      '8'    |
|    smalltable        |    '128'    |
|    weakreflist       |      '8'    |
-------------------------------------|
|       Total          |    '200'    |

Total size of PySetObject '200' bytes
Has set resized: 'No'
216
>>> set_.add(1)
>>> set_.add(2)
>>> set_.add(3)
>>> set_.add(4)
>>> set_.add(5)
>>> sys.getsizeof(set_)
| PySetObject Fields   | Size(bytes) |
|------------------------------------|
|    PyObject_HEAD     |     '16'    |
|      fill            |      '8'    |
|      used            |      '8'    |
|      mask            |      '8'    |
|      table           |      '8'    |
|      hash            |      '8'    |
|      finger          |      '8'    |
|    smalltable        |    '128'    |
|    weakreflist       |      '8'    |
-------------------------------------|
|       Total          |    '200'    |

Total size of PySetObject '200' bytes
Has set resized: 'Yes'
Size of malloc'ed table: '512' bytes
728

返回值是216/728字节,因为sys.getsize增加了16个字节的GC开销

但这里需要注意的重要事情是这一行。

|    smalltable        |    '128'    |

因为对于小表(在第一次调整大小之前),so->table只是一个引用固定大小(8)的so->smalltable(没有分配内存),所以sizeof(PySetObject)足够获取大小,因为它还包括存储大小(128(16(setentry的大小) * 8))。
当调整大小发生时会发生什么?它会构建一个全新的表(malloc'ed)并使用该表而不是 so->smalltables。这意味着已调整大小的集合还会携带128字节的死载荷(固定大小小表的大小),以及malloc'ed so->table的大小。
else {
        newtable = PyMem_NEW(setentry, newsize);
        if (newtable == NULL) {
            PyErr_NoMemory();
            return -1;
        }
    }

    /* Make the set empty, using the new table. */
    assert(newtable != oldtable);
    memset(newtable, 0, sizeof(setentry) * newsize);
    so->mask = newsize - 1;
    so->table = newtable;

有趣的是,如果smalltable的大小已经包含在_PyObject_SIZE(Py_TYPE(so))中,这是否意味着任何已调整大小的集合都会将小表计算两次,并且set_sizeof会超过PySet_MINSIZE * 16 = 128字节的大小?还是说已调整大小的集合只是携带着128字节的死重量,对于它们不再使用的小表而言? - wim
如果我正确理解set_table_resize中的代码(https://github.com/python/cpython/blob/v3.11.0/Objects/setobject.c#L285),那么它应该是后者。 - wim
@wim 我已经编辑了我的答案并添加了我的发现。 - Abdul Niyas P M
是的,我得出了同样的结论,但在这里看到PySetObject的大小分解真的很好。非常好的答案。 - wim

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接