生成所有大小为n的多重集合划分的算法

Question

生成所有大小为n的多重集合划分的算法

6

我一直在尝试找到一种方法来生成一个多重集合中所有不同大小为n的分区，但到目前为止还没有任何结果。首先让我展示一下我想要实现的内容。

假设我们有一个输入向量，其中包含uint32_t类型的元素：

std::vector<uint32_t> input = {1, 1, 2, 2}

假设我们想创建所有不同的2个大小的分区。只有两个这样的分区，即：

[[1, 1], [2, 2]], [[1, 2], [1, 2]]

请注意顺序并不重要，也就是说以下所有的解决方案都是重复且不正确的。

Duplicate because order within a permutation group does not matter:
```
[[2, 1], [1, 2]]
```
Duplicate because order of groups does not matter:
```
[[2, 2], [1, 1]]
```

顺便说一句，这不是某种作业。我在工作中编写代码时遇到了这个问题，但现在只是出于个人兴趣，我想知道如何处理这个问题。与工作相关的问题的参数足够小，以至于生成几千个重复的解决方案并没有真正关系。

当前解决方案（生成重复项）

为了说明我不是没有尝试过提出解决方案而发问，让我尝试解释一下我的当前算法（当与multisets一起使用时会生成重复的解决方案）。

它的工作原理如下：状态具有一个位集，对于每个分区块都设置了n位为1。位集的长度为size(input) - n * index_block()，例如，如果输入向量有8个元素而n = 2，则第一个分区块使用一个8位位集，并且其中有2个位设置为1，下一个分区块使用一个6位位集，并且其中有2个位设置为1等等。

通过按顺序迭代每个位集并提取与当前位集中的1位位置相等的输入向量的元素来从这些位集创建分区。

为了生成下一个分区，我倒序迭代位集。计算下一个位集排列（使用Gosper的反向方法）。如果当前位集中第一个位没有设置（即未选择向量索引0），则该位集将被重置为其起始状态。强制要求第一个位始终设置可以在创建大小n的集合分区时防止生成重复项（如上面显示的第二类重复项）。如果当前位集等于其起始值，则针对以前（更长）的位集重复执行此步骤。

这对于集合非常有效（并且非常快速）。但是，当与multisets一起使用时，它会生成重复解决方案，因为它不知道两个元素在输入向量中出现多次。以下是一些示例输出：

std::vector<uint32_t> input = {1, 2, 3, 4};
printAllSolutions(myCurrentAlgo(input, 2));
=> [[2, 1], [4, 3]], [[3, 1], [4, 2]], [[4, 1], [3, 2]]

std::vector<uint32_t> input = {1, 1, 2, 2};
printAllSolutions(myCurrentAlgo(input, 2));
=> [[1, 1], [2, 2]], [[2, 1], [2, 1]], [[2, 1], [2, 1]]

最后那个（重复的）解决方案是由于算法不知道输入中有重复项而生成的，它在两个示例中生成完全相同的内部状态（即选择哪些索引）。

期望的解决方案

我想现在已经很清楚我想要的是什么了。为了完整起见，它应该如下所示：

std::vector<uint32_t> multiset = {1, 1, 2, 2};
MagicClass myGenerator(multiset, 2);
do {
  std::vector<std::vector<uint32_t> > nextSolution = myGenerator.getCurrent();
  std::cout << nextSolution << std::endl;
} while (myGenerator.calcNext());
=> [[1, 1], [2, 2]]
   [[1, 2], [1, 2]]

即代码将有点像std::next_permutation，它会告知已生成所有解并回到“第一个”解（对于您想使用的任何第一个定义，可能是按字典顺序，但不需要这样）。

我找到的最相关算法是Knuth的《计算机程序设计艺术》第4卷第1部分第7.2.1.5节（第430页）中的M算法。然而，该算法生成所有可能的多重集合划分。书中还有一个练习（7.2.1.5.69，p.778上的解决方案），说明如何修改Alg.M以仅生成最多具有r个分区的解。然而，这仍然允许不同大小的分区（例如，对于r = 2，[[1,2,2]，[1]]将是有效输出）。

有关如何处理此问题的任何想法/技巧/现有算法？请注意，解决方案应高效，即跟踪先前生成的所有解，确定当前生成的解是否为排列，如果是则跳过它，由于随着更多重复项的较长输入的解空间爆炸的速度，这是不可行的。

- AVH

你能否仅删除输入中的重复项？ - Jacob

当然，创建您需要的任何状态（最好是与输入大小线性相关的状态）。 - AVH

我相信我已经找到了一个高效的解决方案。我可能会有时间在本周晚些时候实现和测试它。请继续关注 :)。 - AVH

3个回答

1

这是我的铅笔和纸算法：

Describe the multiset in item quantities, e.g., {(1,2),(2,2)}

f(multiset,result):
  if the multiset is empty:
    return result
  otherwise:
    call f again with each unique distribution of one element added to result and 
    removed from the multiset state


Example:
{(1,2),(2,2),(3,2)} n = 2

11       -> 11 22    -> 11 22 33
            11 2  2  -> 11 23 23
1  1     -> 12 12    -> 12 12 33
            12 1  2  -> 12 13 23


Example:
{(1,2),(2,2),(3,2)} n = 3

11      -> 112 2   -> 112 233
           11  22  -> 113 223
1   1   -> 122 1   -> 122 133
           12  12  -> 123 123

让我们解决m69在下面评论的处理潜在重复分配的问题：

{A,B,B,C,C,D,D,D,D}

We've reached {A, , }{B, , }{B, , }, have 2 C's to distribute
and we'd like to avoid `ac  bc  b` generated along with `ac  b   bc`.

Because our generation in the level just above is ordered, the series of identical 
counts will be continuous. When a series of identical counts is encountered, make 
the assignment for the whole block of identical counts (rather than each one), 
and partition that contribution in descending parts; for example,

      | identical |
ac     b      b
ac     bc     b     // descending parts [1,0]

Example of longer block:

      |    identical block     |  descending parts
ac     bcccc  b      b      b    // [4,0,0,0] 
ac     bccc   bc     b      b    // [3,1,0,0]
ac     bcc    bcc    b      b    // [2,2,0,0]
...

- גלעד ברקן

我开始写了类似的东西，但发现了一些额外的困难。如果您有输入{A,B,B,C,C,D,D,D,D}，并且您已经开始创建部分解决方案，例如{A, , }{B, , }{B, , }，并且您想要为第三个元素C插入分配{1,1}，那么您必须区分相同的解决方案，例如{A,C, }{B,C, }{B, , }和{A,C, }{B, , }{B,C, }以及不同的解决方案，例如{A, , }{B,C, }{B,C, }，这证明并不那么简单。 - m69 ''snarky and unwelcoming''

@m69 说得好。假设你在分发过程中通过跟踪你正在分发的内容来控制它，你能举个例子，即使有这种控制（仅限于分发过程），也不足够吗？ - גלעד ברקן

@m69 它不能一次处理多个相同的块吗？ - גלעד ברקן

是的，我猜可以。您是否会在创建相同块时跟踪它们并将其状态存储在某个地方，还是每次向其添加内容时都检查块是否与上一个块相同？ - m69 ''snarky and unwelcoming''

@m69 没有想法，没有尝试编码（我的第一个想法是一种“takeWhile”，在分发轮回期间会测量块的长度；因此不是在生成重复项时）。 - גלעד ברקן

显示剩余3条评论

1

这里有一个可行的解决方案，利用了Hervé Brönnimann在N2639中提供的next_combination函数。注释应该让它变得非常容易理解。"herve/combinatorics.hpp"文件包含了N2639中列出的代码，位于herve命名空间内。它是C++11/14版本的，转换为旧版本应该很容易。请注意，我只是快速测试了解决方案。此外，我刚刚从一个基于类的实现中提取了它，所以可能会有一些额外的错误。一个快速的初始测试似乎证实它可以工作，但可能存在一些特殊情况它不能处理。

#include <cstdint>
#include <iterator>

#include "herve/combinatorics.hpp"

template <typename BidirIter>
bool next_combination_partition (BidirIter const & startIt,
  BidirIter const & endIt, uint32_t const groupSize) {
  // Typedefs
  using tDiff = typename std::iterator_traits<BidirIter>::difference_type;

  // Skip the last partition, because is consists of the remaining elements.
  // Thus if there's 2 groups or less, the start should be at position 0.
  tDiff const totalLength = std::distance(startIt, endIt);
  uint32_t const numTotalGroups = std::max(static_cast<uint32_t>((totalLength - 1) / groupSize + 1), 2u);
  uint32_t curBegin = (numTotalGroups - 2) * groupSize;
  uint32_t const lastGroupBegin = curBegin - 1;
  uint32_t curMid = curBegin + groupSize;
  bool atStart = (totalLength != 0);

  // Iterate over combinations from back of list to front. If a combination ends
  // up at its starting value, update the previous one as well.
  for (; (curMid != 0) && (atStart);
    curMid = curBegin, curBegin -= groupSize) {
    // To prevent duplicates, first element of each combination partition needs
    // to be fixed. So move start iterator to the next element. This is not true
    // for the starting (2nd to last) group though.
    uint32_t const startIndex = std::min(curBegin + 1, lastGroupBegin + 1);
    auto const iterStart = std::next(startIt, startIndex);
    auto const iterMid = std::next(startIt, curMid);
    atStart = !herve::next_combination(iterStart, iterMid, endIt);
  }

  return !atStart;
}

编辑下面是我快速拼凑的测试代码（“combopart.hpp”显然是包含上述函数的文件）。

#include "combopart.hpp"

#include <algorithm>
#include <cstdint>
#include <iostream>
#include <iterator>
#include <vector>

int main (int argc, char* argv[]) {
  uint32_t const groupSize = 2;

  std::vector<uint32_t> v;
  v = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
  v = {0, 0, 0, 1, 1, 1, 2, 2, 2, 3};
  v = {1, 1, 2, 2};

  // Make sure contents are sorted
  std::sort(v.begin(), v.end());

  uint64_t count = 0;
  do {
    ++count;

    std::cout << "[ ";
    uint32_t elemCount = 0;
    for (auto it = v.begin(); it != v.end(); ++it) {
      std::cout << *it << " ";
      elemCount++;
      if ((elemCount % groupSize == 0) && (it != std::prev(v.end()))) {
        std::cout << "| ";
      }
    }
    std::cout << "]" << std::endl;
  } while (next_combination_partition(v.begin(), v.end(), groupSize));

  std::cout << std::endl << "# elements: " << v.size() << " - group size: " <<
    groupSize << " - # combination partitions: " << count << std::endl;

  return 0;
}

编辑2 改进算法。用条件移动（使用std::max）和将atStart布尔值设为false替换了早期退出分支。尚未经过测试，请注意。

编辑3 需要额外的修改，以免“修复”倒数第二个分区中的第一个元素。附加代码应该编译为条件移动，因此不应与之相关的分支成本。

P.S.：我知道由@Howard Hinnant生成组合的代码（可在https://howardhinnant.github.io/combinations.html上找到）比Hervé Brönnimann的代码快得多。但是，该代码无法处理输入中的重复项（因为就我所见，它甚至从未引用迭代器），而我的问题明确要求。另一方面，如果您确定输入不包含重复项，则绝对是您想要使用上述函数的代码。

- AVH

1

如果你只是需要将相同的元素分组在一起而对输入进行排序，你可能会对我关于这个问题提出的这个问题感兴趣：https://dev59.com/UFoU5IYBdhLWcg3wTV3E#37579211 - m69 ''snarky and unwelcoming''

在这种情况下，内容确实需要排序才能正确检测是否已输出了所有可能的组合/分区。不过问题很好！ - AVH

1

我已经编译并测试了这个解决方案。结果是错误的。该解决方案会产生具有不同对顺序的重复分区。例如，输入 {1, 1, 2, 2} 会产生三个分区，而不是正确的两个：[ 1 1 | 2 2 ]; [ 1 2 | 1 2 ]; [ 2 2 | 1 1 ]。 - vedg

1

该答案的第七个修订版本处理了一些测试用例，但在groupSize=2和 v={1,1,2,3,4,5}的情况下工作不正确。它返回了15个组合分区，而实际正确的是9个。当v={1,2,2,3,4,5}时返回12个分区；对于v={1,2,3,3,4,5} - 返回10个分区；对于重复的4和5，输出是正确的：9个分区。 - vedg

1

@m69的答案中的算法似乎工作正常，并且可以针对2（成对）的组大小进行优化。配对是我感兴趣的。我已经实现了一个类似于你的算法，但更简单和优化，因为它只处理成对而不是任意组大小。基于@m69答案的尚未完全优化的算法版本仅在您的算法正确工作的情况下（即当没有重复元素时）慢16倍。 - vedg

显示剩余2条评论

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- m69 ''snarky and unwelcoming'' · Accepted Answer

一个递归算法来逐个分配元素可以基于以下几个简单的规则：

首先，对不同的元素进行排序或计数；它们不必按任何特定顺序排列，只需将相同的元素组合在一起。 （这一步将简化后续步骤，但可以跳过。）

   {A,B,D,C,C,D,B,A,C} -> {A,A,B,B,D,D,C,C,C}

从空解决方案开始，逐个插入元素，使用以下规则：

（注意：本段内容中无需翻译）

   { , , } { , , } { , , }

在插入元素之前，先查找重复的块，例如：

   {A, , } { , , } { , , }  
                    ^dup^

   {A, , } {A, , } {A, , }  
            ^dup^   ^dup^

将该元素插入到每个有可用空间的非重复块中：

   partial solution: {A, , } {A, , } { , , }  
                              ^dup^

   insert element B: {A,B, } {A, , } { , , }  
                     {A, , } {A, , } {B, , }

如果已经存在相同的元素，则不要将新元素放在其前面：

   partial solution:  {A, , } {B, , } { , , }  
   insert another B:  {A,B, } {B, , } { , , }  <- ILLEGAL  
                      {A, , } {B,B, } { , , }  <- OK
                      {A, , } {B, , } {B, , }  <- OK

在插入一个已有N个相同元素的元素时，请确保在当前元素后留下N个空位：

   partial solution:  {A, , } {A, , } {B,B, }  
   insert first D:    {A,D, } {A, , } {B,B, }  <- OK  
                      {A, , } {A, , } {B,B,D}  <- ILLEGAL (NO SPACE FOR 2ND D)

最后一组相同的元素可以一次性插入：

   partial solution:  {A,A, } {B,B,D} {D, , }  
   insert C,C,C:      {A,A,C} {B,B,D} {D,C,C}

所以算法大致如下：

// PREPARATION  
Sort or group input.              // {A,B,D,C,C,D,B,A,C} -> {A,A,B,B,D,D,C,C,C}  
Create empty partial solution.    // { , , } { , , } { , , }  
Start recursion with empty partial solution and index at start of input.  

// RECURSION  
Receive partial solution, index, group size and last-used block.  
If group size is zero:  
    Find group size of identical elements in input, starting at index.  
    Set last-used block to first block.  
Find empty places in partial solution, starting at last-used block.  
If index is at last group in input:  
    Fill empty spaces with elements of last group.
    Store complete solution.
    Return from recursion.
Mark duplicate blocks in partial solution.  
For each block in partial solution, starting at last-used block:  
    If current block is not a duplicate, and has empty places,  
    and the places left in current and later blocks is not less than the group size:
        Insert element into copy of partial solution.
        Recurse with copy, index + 1, group size - 1, current block.

我测试了一个简单的JavaScript实现此算法，它给出了正确的输出。