使用有限数量的颜色来高效地组合N个不同颜色元素

Question

使用有限数量的颜色来高效地组合N个不同颜色元素

pythonalgorithmcombinationscombinatorics

7

给定由C种颜色着色的N个元素集，如何找到长度为L的每个可能组合，其中最多包含M种颜色？

我尝试了使用itertools.combinations算法生成所有可能的组合，然后筛选出不满足最大颜色条件的组合。

from itertools import combinations as cb

def allowed_combinations(elements, combination_size=4, max_colors=3):

    colors = set([c for k, c in elements.items()])
    combinations = cb(elements, combination_size)
    for combination in combinations:
        colors = set([elements[element] for element in combination])
        if len(colors) > max_colors:
            continue
        yield combination


elements = dict()
elements['A'] = 'red'
elements['B'] = 'red'
elements['C'] = 'blue'
elements['D'] = 'blue'
elements['E'] = 'green'
elements['F'] = 'green'
elements['G'] = 'green'
elements['H'] = 'yellow'
elements['I'] = 'white'
elements['J'] = 'white'
elements['K'] = 'black'

combinations = allowed_combinations(elements)

for c in combinations:
    for element in c:
        print("%s-%s" % (element, elements[element]))
    print "\n"

输出结果如下：

A-red
C-blue
B-red
E-green


A-red
C-blue
B-red
D-blue


A-red
C-blue
B-red
G-green


A-red
C-blue
B-red
F-green

...

问题在于生成所有可能的组合可能会非常耗费计算资源。例如，在我的情况下，L通常为6，元素数量N约为50，因此它给出了Bin（50,6）= 15890700个可能的组合。如果允许组合中的颜色最大数量很小，则大多数组合都是“无用”的，因此它们在过滤步骤中被丢弃。我的直觉是应该将过滤步骤放在组合步骤内部/之前，以避免组合的爆炸式增长，但我不知道如何实现。

- alberto

3个回答

3

组合问题以易于陈述但可能难以解决而闻名。对于这个问题，我不会使用itertools，而是编写一个自定义生成器。例如，

def combs(elt2color, combination_size=4, max_colors=3):

    def inner(needed, index):
        if needed == 0:
            yield result
            return
        if n - index < needed:
            # not enough elements remain to reach
            # combination_size
            return
        # first all results that don't contain elts[index]
        for _ in inner(needed, index + 1):
            yield result
        # and then all results that do contain elts[index]
        needed -= 1
        elt = elts[index]
        color = elt2color[elt]
        color_added = color not in colors_seen
        colors_seen.add(color)
        if len(colors_seen) <= max_colors:
            result[needed] = elt
            for _ in inner(needed, index + 1):
                yield result
        if color_added:
            colors_seen.remove(color)

    elts = tuple(elt2color)
    n = len(elts)
    colors_seen = set()
    result = [None] * combination_size
    for _ in inner(combination_size, 0):
        yield tuple(result)

然后：

elt2color = dict([('A', 'red'), ('B', 'red'), ('C', 'blue'),
                  ('D', 'blue'), ('E', 'green'), ('F', 'green'),
                  ('G', 'green'), ('H', 'yellow'), ('I', 'white'),
                  ('J', 'white'), ('K', 'black')])
for c in combs(elt2color):
    for element in c:
        print("%s-%s" % (element, elements[element]))
    print "\n"

使用这种方法会产生与您的后处理代码相同的188组合，但是在内部会在组合超过 max_colors 颜色时放弃部分组合。由于无法更改 itertools 功能内部的操作方式，因此当您需要控制它时，需要编写自己的代码。

使用itertools

以下是另一种方法，首先生成所有恰好有1种颜色、然后是所有恰好有2种颜色等解决方案。 itertools 可以直接用于其中大部分操作，但在最低级别上仍需要自定义生成器。我发现这比完全自定义的生成器难以理解，但这可能对您更清晰：

def combs2(elt2color, combination_size=4, max_colors=3):
    from collections import defaultdict
    from itertools import combinations
    color2elts = defaultdict(list)
    for elt, color in elt2color.items():
        color2elts[color].append(elt)

    def at_least_one_from_each(iterables, n):
        if n < len(iterables):
            return # impossible
        if not n or not iterables:
            if not n and not iterables:
                yield ()
            return
        # Must have n - num_from_first >= len(iterables) - 1,
        # so num_from_first <= n - len(iterables) + 1
        for num_from_first in range(1, min(len(iterables[0]) + 1,
                                           n - len(iterables) + 2)):
            for from_first in combinations(iterables[0],
                                           num_from_first):
                for rest in at_least_one_from_each(iterables[1:],
                                             n - num_from_first):
                    yield from_first + rest

    for numcolors in range(1, max_colors + 1):
        for colors in combinations(color2elts, numcolors):
            # Now this gets tricky.  We need to pick
            # combination_size elements across all the colors, but
            # must pick at least one from each color.
            for elements in at_least_one_from_each(
                    [color2elts[color] for color in colors],
                    combination_size):
                yield elements

我没有计时，因为我不在意；-) 完全自定义生成器的单个 result 列表被重用来构建每个输出，这样可以减少动态内存周转率。第二种方法通过将多个级别的 from_first 和 rest 元组粘合在一起创建了大量的内存翻转-这在很大程度上是不可避免的，因为它使用 itertools 在每个级别生成 from_first 元组。

在内部，itertools 函数几乎总是以更类似于第一个代码示例的方式工作，出于同样的原因，尽可能地重用内部缓冲区。

还有一个

这更多的是为了说明一些微妙之处。如果我要在 C 中实现这个功能作为 itertools 函数，我会怎么做呢？所有的 itertools 函数都是首先在 Python 中进行原型设计的，但在一定程度上，它们被简化为使用小整数向量（不使用集合、字典、序列切片或粘合部分结果序列的“内部循环” - 在初始化后尽可能地坚持对脏简单本机 C 类型的 O(1) 最坏情况时间操作）。

在更高的层面上，这个 itertools 函数将接受任何可迭代对象作为其主要参数，并几乎一定保证按字典索引顺序返回组合。因此，下面是可以做到所有这些的代码。除了 iterable 参数之外，它还需要一个 elt2ec 映射，该映射将可迭代对象中的每个元素映射到其等效类（对于您来说，这些是命名颜色的字符串，但任何可用作字典键的对象都可以用作等效类）：

def combs3(iterable, elt2ec, k, maxec):
    # Generate all k-combinations from `iterable` spanning no
    # more than `maxec` equivalence classes.
    elts = tuple(iterable)
    n = len(elts)
    ec = [None] * n  # ec[i] is equiv class ordinal of elts[i]
    ec2j = {} # map equiv class to its ordinal
    for i, elt in enumerate(elts):
        thisec = elt2ec[elt]
        j = ec2j.get(thisec)
        if j is None:
            j = len(ec2j)
            ec2j[thisec] = j
        ec[i] = j
    countec = [0] * len(ec2j)
    del ec2j

    def inner(i, j, totalec):
        if i == k:
            yield result
            return
        for j in range(j, jbound[i]):
            thisec = ec[j]
            thiscount = countec[thisec]
            newtotalec = totalec + (thiscount == 0)
            if newtotalec <= maxec:
                countec[thisec] = thiscount + 1
                result[i] = j
                yield from inner(i+1, j+1, newtotalec)
                countec[thisec] = thiscount

    jbound = list(range(n-k+1, n+1))
    result = [None] * k
    for _ in inner(0, 0, 0):
         yield (elts[i] for i in result)

请注意，这是Python 3的代码。正如所宣传的那样，inner()中没有比用一个小整数索引向量更高级的东西。唯一剩下的事情是消除递归生成。虽然这很繁琐，但由于它在此处不会说明任何特别有趣的东西，因此我将忽略它。

无论如何，有趣的事情是计时。如评论中所述，计时结果受您使用的测试用例强烈影响。combs3()在某些情况下最快，但并不经常！它几乎总是比我的原始combs()快，但通常比我的combs2()或@GarethRees的可爱的constrained_combinations()慢。

那么，当combs3()已经优化到“几乎所有操作都是无意识的C级操作”时，为什么会发生这种情况呢？很容易理解！它仍然是用Python编写的。combs2()和constrained_combinations()使用C编码的itertools.combinations()来完成大部分工作，这使得它们之间存在明显差异。如果combs3()是用C编写的，那么它将超越它们。

当然，任何一个都可以比原始帖子中的allowed_combinations()运行得更快 - 但它也可能是最快的（例如，选择几乎所有输入，其中max_colors非常大以至于没有排除任何组合 - 然后allowed_combinations()浪费的努力很少，而所有这些其他方法都会增加很多额外的开销来“优化”永远不会发生的修剪）。

- Tim Peters

谢谢@Tim！我计时了一下：combs = 791纳秒；combs2 = 632纳秒；另一方面，@Gareth的解决方案需要大约14微秒。所以为了速度，我会牺牲清晰度 :) - alberto

@AlbertoLumbreras：你可以说明一下这些时间是从哪里来的吗？在我的时间测试中，Tim Peters的'combs2'和我的'constrained_combinations'花费了类似的时间。（请参见我的答案中的“性能”部分。） - Gareth Rees

我在ipython笔记本上使用了神奇的%%timeit，并在该单元格中执行了constrained_combinations(nodes, 4, 3)。但是，如果我运行你的测试（即使仍然在笔记本中），那么你是正确的，combs和constrained_combinations非常相似。 - alberto

1

时间掌握很棘手，因为不同算法的不同部分会根据输入而支配。例如，更大的“combination_size”会使“粘合”结果元组变得更加昂贵：总体而言，这需要基本上是“quadratic”的时间来完成“combination_size”。对于微小的3、4、5、6，这可能并不重要。“combs”不会受到这种影响（它重用一个长度为“combination_size”的单个“result”列表），但“combs”不像其他算法一样努力尽早切断注定失败的部分元组。因此，使用所有3个算法，并选择当前输入最快的算法 - LOL ;-) - Tim Peters

0

大致概述。

您总共有C种不同的颜色。对于每个k，1 <= k <= M，以Bin(C，k)种方式选择k种颜色（我在这里使用您的符号假设Bin意味着二项式系数）。

对于上述每个选择，收集所有具有所选颜色的元素。假设它给出P个不同的元素。然后从这些P个元素中以Bin(P，L)种不同的方式选择L个元素。

所有上述内容都要遵守明显的检查，例如M <= C，L <= P等。

这种方法的优点是它只会生成有效的组合，并且每个有效的组合将恰好生成一次。（编辑：正如评论中指出的那样，这不是真正的重复，可以生成组合）。

附：以下是上述算法的实现，修复了重复组合的问题：

from itertools import combinations


elts  = { 'A' : 'red', 'B' : 'red', 'C' : 'blue', 'D' : 'blue',
          'E': 'green', 'F' : 'green', 'G' : 'green', 'H' : 'yellow',
          'I' : 'white', 'J' : 'white', 'K' : 'black' }

def combs (elts, size = 4, max_colors = 3):
    # Count different colors
    colors = {}
    for e in elts.values():
        colors [e] = 1
    ncolors = len(colors)

    # for each different number of colors between 1 and 'max_colors' 
    for k in range (1, max_colors + 1):
        # Choose 'k' different colors
        for selected_colors in combinations (colors, k):
            # Select ell the elements with these colors
            selected_elts = []
            for e, c in elts.items():
                if c in selected_colors:
                    selected_elts.append (e)
            # Choose 'size' of these elements
            for chosen_elts in combinations (selected_elts, size):
                # Check the chosen elements are of exactly 'k' different colors
                t = {}
                for e in chosen_elts:
                    t[elts[e]] = 1
                if len(t) == k:
                    yield chosen_elts


#for e in combs (elts):
#    print (e)

print (len (list (combs (elts))))

附注：我还用程序here计时了Tim的comb2、我的comb和Gareth的constrained_combinations，结果如下：

combs2 =  5.214529
constr combs = 5.290079
combs = 4.952063
combs2 = 5165700
constr combs = 5165700
combs = 5165700

- chill

3

唉，这比想象的更难。例如，当k=2时，在OP的例子中，{'white'，'black'}是一组包含2种颜色的集合。它对应于元素集合{'I'，'J'，'K'}。如果L==2，则该集合中只有3个2-组合中的2个实际上跨越两种不同的颜色。另一个组合-('I'，'J')-具有两个元素都具有颜色'white'。因此，它会重复为k=1提供的结果之一。 - Tim Peters

@TimPeters，确实如此。这可以通过接受具有恰好k种不同颜色的组合来进行纠正，这是不幸的，我的目标是避免“过滤”。 - chill

是的，我的第一次尝试编写代码也遇到了同样的问题;-) “自定义全部”似乎是解决这个问题的最佳方法。 - Tim Peters

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Gareth Rees · Accepted Answer

这里有一个比其他答案更简单的实现。基本方法如下：

选择一个到目前为止还没有被选过的值（在你的术语中是“颜色”）；
循环遍历i，与该值相关联的键（“元素”）的数量，这些键将包含在输出中；
循环遍历c，这些键的组合长度为i；
递归选择下一个值。

from collections import defaultdict, deque
from itertools import combinations

def constrained_combinations(elements, r, s):
    """Generate distinct combinations of 'r' keys from the dictionary
    'elements' using at most 's' different values. The values must be
    hashable.

        >>> from collections import OrderedDict
        >>> elements = OrderedDict(enumerate('aabbc'))
        >>> cc = constrained_combinations
        >>> list(cc(elements, 2, 1))
        [(0, 1), (2, 3)]
        >>> list(cc(elements, 3, 2))
        [(0, 1, 2), (0, 1, 3), (0, 1, 4), (0, 2, 3), (1, 2, 3), (2, 3, 4)]
        >>> list(cc(elements, 3, 3)) == list(combinations(range(5), 3))
        True
        >>> sum(1 for _ in cc(OrderedDict(enumerate('aabbcccdeef')), 4, 3))
        188

    """
    # 'value_keys' is a map from value to a list of keys associated
    # with that value; 'values' is a list of values in reverse order of
    # first appearance.
    value_keys = defaultdict(list)
    values = deque()
    for k, v in elements.items():
        if v not in value_keys:
            values.appendleft(v)
        value_keys[v].append(k)

    def helper(current, r, s):
        if r == 0:
            yield current
            return
        if s == 0 or not values:
            return
        value = values.pop()
        keys = value_keys[value]
        for i in range(min(r, len(keys)), -1, -1):
            for c in combinations(keys, i):
                for result in helper(current + c, r - i, s - min(i, 1)):
                    yield result
        values.append(value)

    return helper((), r, s)

注意事项

In Python 3.3 or later, you could use the yield from statement to simplify the recursive call:
```
yield from helper(current + c, r - i, s - min(i, 1))
```
If you're wondering why the doctests use collections.OrderedDict, it's so that the combinations can be returned in a predictable order, which is necessary for the tests to work.
The code reverses the list values, and iterates downwards over i so that if the caller passes in an OrderedDict, the combinations are returned in a sensible order (with values that appear early in the input also appearing early in the output).
Given the slight awkwardness in getting predictable output from this function, it would, I think, be worth considering changing the interface so that instead of taking a dictionary mapping keys to values, it would take an iterable of (key, value) pairs.

性能

这在速度上与 Tim Peter 的 combs2 大致相似：

>>> from timeit import timeit
>>> elements = dict(enumerate('abcde' * 10))
>>> test = lambda f:timeit(lambda:sum(1 for _ in f(elements, 6, 3)), number=1)
>>> test(combs2)
11.403807007009163
>>> test(constrained_combinations)
11.38378801709041