有序查询集的Django查询集迭代器

Question

有序查询集的Django查询集迭代器

3

我想使用queryset迭代器来迭代一个大数据集。Django提供了iterator()来实现这一点，但是每次迭代都会访问数据库。我找到了以下代码来分块迭代 -

  def queryset_iterator(queryset, chunksize=1000):
    '''''
    Iterate over a Django Queryset ordered by the primary key
    This method loads a maximum of chunksize (default: 1000) rows in it's
    memory at the same time while django normally would load all rows in it's
    memory. Using the iterator() method only causes it to not preload all the
    classes.
    Note that the implementation of the iterator
    does not support ordered query sets.
    '''
    pk = 0
    last_pk = queryset.order_by('-pk').values_list('pk', flat=True).first()
    if last_pk is not None:
        queryset = queryset.order_by('pk')
        while pk < last_pk:
            for row in queryset.filter(pk__gt=pk)[:chunksize]:
                pk = row.pk
                yield row
            gc.collect()

这适用于无序的查询集。是否有解决方案/变通方法可以在排序的查询集上执行此操作？

- Ashish Gupta

1

我认为你应该接受Igor的答案。 - spinus

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Igor Kremin · Accepted Answer

这是我的代码，带有排序功能。

顺便提一下，当处理查询集中的项被修改（删除或添加甚至只有一个项）时，您使用的迭代器会出现“永久循环”。

下面的迭代器没有在last_pk上进行无用的查询。

def queryset_iterator(queryset, chunksize=10000, key=None):
    key = [key] if isinstance(key, str) else (key or ['pk'])
    counter = 0
    count = chunksize
    while count == chunksize:
        offset = counter - counter % chunksize
        count = 0
        for item in queryset.all().order_by(*key)[offset:offset + chunksize]:
            count += 1
            yield item
        counter += count
        gc.collect()