对itertools的groupby函数返回值调用roundrobin函数是什么意思?

3
我正在寻找更高效和Pythonic的方法来使用itertools的roundrobin配方,用于由itertools.groupby()形成的组。具体而言,我有一个URL列表(未排序),希望重新排序它们,以便其结果的排序将每个唯一netloc(主机)之间的最大“距离”(或多样化,也许)放置在urllib.parse属性定义。下面是可复制的示例。我目前正在使用itertools.groupby()加上它的roundrobin配方,但由于groupby()的性质,这似乎需要从每个组中形成一个中间列表。样本数据:
import itertools as it
import urllib.parse

bases = ('https://www.google.com', 'https://www.youtube.com',
         'https://docs.scipy.org', 'https://www.group.me')
urls = []
counts = (1, 5, 10, 15)
for c, b in zip(counts, bases):
    for i in range(c):
        urls.append(f'{b}/{i}')

pprint(urls)
# ['https://www.google.com/0',
#  'https://www.youtube.com/0',
#  'https://www.youtube.com/1',
#  'https://www.youtube.com/2',
#  'https://www.youtube.com/3',
#  'https://www.youtube.com/4',
#  'https://docs.scipy.org/0',
#  'https://docs.scipy.org/1',
#  'https://docs.scipy.org/2',
#  'https://docs.scipy.org/3',
#  'https://docs.scipy.org/4',
#  'https://docs.scipy.org/5',
#  'https://docs.scipy.org/6',
#  'https://docs.scipy.org/7',
#  'https://docs.scipy.org/8',
#  'https://docs.scipy.org/9',
#  'https://www.group.me/0',
#  'https://www.group.me/1',
#  'https://www.group.me/2',
#  'https://www.group.me/3',
#  'https://www.group.me/4',
#  'https://www.group.me/5',
#  'https://www.group.me/6',
#  'https://www.group.me/7',
#  'https://www.group.me/8',
#  'https://www.group.me/9',
#  'https://www.group.me/10',
#  'https://www.group.me/11',
#  'https://www.group.me/12',
#  'https://www.group.me/13',
#  'https://www.group.me/14']

当前解决方案(从每个组中取1个,如果该组为空则跳过,直到所有组都引发StopIteration):

grp = it.groupby(sorted(urls), key=lambda u: urllib.parse.urlsplit(u).netloc)
shuffled = list(roundrobin(*(list(g) for _, g in grp)))
#                            ^^ Each group is otherwise lost because
#                               groupby() itself is an iterator

样例的预期输出如下所示:
['https://docs.scipy.org/0',
 'https://www.google.com/0',
 'https://www.group.me/0',
 'https://www.youtube.com/0',
 'https://docs.scipy.org/1',
 'https://www.group.me/1',
 'https://www.youtube.com/1',
 'https://docs.scipy.org/2',
 'https://www.group.me/10',
 'https://www.youtube.com/2',
 'https://docs.scipy.org/3',
 'https://www.group.me/11',
 'https://www.youtube.com/3',
 'https://docs.scipy.org/4',
 'https://www.group.me/12',
 'https://www.youtube.com/4',
 'https://docs.scipy.org/5',
 'https://www.group.me/13',
 'https://docs.scipy.org/6',
 'https://www.group.me/14',
 'https://docs.scipy.org/7',
 'https://www.group.me/2',
 'https://docs.scipy.org/8',
 'https://www.group.me/3',
 'https://docs.scipy.org/9',
 'https://www.group.me/4',
 'https://www.group.me/5',
 'https://www.group.me/6',
 'https://www.group.me/7',
 'https://www.group.me/8',
 'https://www.group.me/9']

什么是更有效的处理方式?

也许一个灵活的替代方案可以帮助 groupby,例如一个 defaultdict - pylang
1个回答

2

虽然改进不大,但您可以使用 itertools.zip_longest 进行一些微调来实现相同的效果:

shuffled = list(x for i in it.zip_longest(*(list(g) for _, g in grp)) for x in i if x)
# flattening the sublists and only returning the non-None values

优点是您不必定义“轮询”配方。 然而,节省的时间可以忽略不计(定时为n = 10000):
# 3.7466756048055094 # zip_longest
# 4.077965201903506  # roundrobin

我觉得还有另一种解决方案可以使用collections.Counter或在sorted(list)上使用sort(key=...),但我还没有想通这个问题,感觉时间复杂度可能比你的实现更严重,因为它可能依赖于比编译模块更多的Python代码。不过这是一个有趣的问题,以后可能会再次讨论。

从时间复杂度的角度来看,这个很难,似乎不可能在少于N^2的时间内完成。同样,我想知道是否存在某些魔法公式给出计数。 - Brad Solomon

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接