按相似度排序

Question

按相似度排序

10

我有一系列的订单。

[a, b]
[a, b, c]
[a, b, c, d]
[a, b, c, d]
[b, c]
[c, d]

假设a、b、c和d是SKU，有很多装满它们的大箱子。同时有成千上万的订单和几百种可能的SKU。

现在想象一下，当打包这些订单时，如果一个订单缺少前一个订单的商品，您必须把该SKU的盒子放好（并类似地拿出没有的盒子）。

如何排序以使箱子更换次数最少？或者更加程序化的说：如何最小化累积汉明距离/最大化集合中相邻项之间的交集？

我真的不知道从哪里开始。这方面已经有算法了吗？是否有一个合理的近似值？

返回:

假设a、b、c和d是SKU，有很多装满它们的大箱子。同时有成千上万的订单和几百种可能的SKU。

现在想象一下，当打包这些订单时，如果一个订单缺少前一个订单的商品，您必须把该SKU的盒子放好（并类似地拿出没有的盒子）。

如何排序以使箱子更换次数最少？或者更加程序化的说：如何最小化累积汉明距离/最大化集合中相邻项之间的交集？

我真的不知道从哪里开始。这方面已经有算法了吗？是否有一个合理的近似值？

- Cory Nelson

我是否误解了你的问题，可以通过考虑一个MxN矩阵来描述，其中M是订单数量，N是SKU数量，ij处的项目描述了订单是否包含SKU？因此，您想重新排序矩阵行以最小化累积汉明距离。对我来说，这似乎是一个NP完全/ NP难问题（取决于您如何完全制定它），非常类似于McEliece和van Tilborg在论文“关于寻找好代码的固有难度”的描述中所描述的问题。是这种情况吗？ - mmgp

抱歉，这看起来更像是一个最小距离哈密顿路径，而不是TSP问题。 - irrelephant

没错，@mmgp，这就是我的问题所在。实际上我不需要完美的结果，所以看到一些近似值也能让我很开心。 - Cory Nelson

2个回答

1

我非常喜欢这个问题，以至于我忍不住编写了上面建议的算法。代码有点长，所以我将其放在单独的响应中。

它在示例中生成了这个序列。

Step 1: c d
Step 2: b c
Step 3: a b c
Step 4: a b c d
Step 5: a b c d
Step 6: a b

请注意，此算法忽略初始设置和最终拆除成本。它仅考虑设置之间的距离。这里的汉明距离为2 + 1 + 1 + 0 + 2 = 6。这与问题中给出的顺序的总距离相同。

#include <stdio.h>
#include <stdlib.h>

// With these data types we can have up to 64k items and 64k sets of items,
// But then the table of pairs is about 20Gb!
typedef unsigned short ITEM, INDEX;

// A sku set in the problem.
struct set {
  INDEX n_elts;
  ITEM *elts;
};

// A pair of sku sets and associated info.
struct pair {
  INDEX i, j;         // Indices of sets.
  ITEM dist;          // Hamming distance between sets.
  INDEX rank, parent; // Disjoint set union/find fields.
};

// For a given set, the adjacent ones along the path under construction.
struct adjacent {
  unsigned char n;  // 0, 1, or 2.
  INDEX elts[2];    // Indices of n adjacent sets.
};

// Some tracing functions for fun.
void print_pair(struct pair *pairs, int i)
{
  struct pair *p = pairs + i;
  printf("%d:(%d,%d@%d)[%d->%d]\n", i, p->i, p->j, p->dist, p->rank, p->parent);
}

void print_adjacent(struct adjacent *adjs, int i)
{
  struct adjacent *a = adjs + i;
  switch (a->n) {
    case 0: printf("%d:o", i); break;
    case 1: printf("%d:o->%d\n", i, a->elts[0]); break;
    default: printf("%d:%d<-o->%d\n", i, a->elts[0], a->elts[1]); break;
  }
}

// Compute the Hamming distance between two sets. Assumes elements are sorted.
// Works a bit like merging.
ITEM hamming_distance(struct set *a, struct set *b)
{
  int ia = 0, ib = 0;
  ITEM d = 0;   
  while (ia < a->n_elts && ib < b->n_elts) {
    if (a->elts[ia] < b->elts[ib]) {
      ++d;
      ++ia;
    }
    else if (a->elts[ia] > b->elts[ib]) {
      ++d;
      ++ib;
    }
    else {
      ++ia;
      ++ib;
    }
  }
  return d + (a->n_elts - ia) + (b->n_elts - ib);
}

// Classic disjoint set find operation.
INDEX find(struct pair *pairs, INDEX x)
{
  if (pairs[x].parent != x)
    pairs[x].parent = find(pairs, pairs[x].parent);
  return pairs[x].parent;
}

// Classic disjoint set union. Assumes x and y are canonical.
void do_union(struct pair *pairs, INDEX x, INDEX y)
{
  if (x == y) return;
  if (pairs[x].rank < pairs[y].rank)
    pairs[x].parent = y;
  else if (pairs[x].rank > pairs[y].rank)
    pairs[y].parent = x;
  else {
    pairs[y].parent = x;
    pairs[x].rank++;
  }
}

// Sort predicate to sort pairs by Hamming distance.
int by_dist(const void *va, const void *vb)
{
  const struct pair *a = va, *b = vb;
  return a->dist < b->dist ? -1 : a->dist > b->dist ? +1 : 0;
}

// Return a plan with greedily found least Hamming distance sum.
// Just an array of indices into the given table of sets.
// TODO: Deal with calloc/malloc failure!
INDEX *make_plan(struct set *sets, INDEX n_sets) 
{
  // Allocate enough space for all the pairs taking care for overflow. 
  // This grows as the square of n_sets!
  size_t n_pairs = (n_sets & 1) ? n_sets / 2 * n_sets : n_sets / 2 * (n_sets - 1);
  struct pair *pairs = calloc(n_pairs, sizeof(struct pair));

  // Initialize the pairs.
  int ip = 0;
  for (int j = 1; j < n_sets; j++) {
    for (int i = 0; i < j; i++) {
      struct pair *p = pairs + ip++;
      p->i = i;
      p->j = j;
      p->dist = hamming_distance(sets + i, sets + j);
    }
  }

  // Sort by Hamming distance.
  qsort(pairs, n_pairs, sizeof pairs[0], by_dist);

  // Initialize the disjoint sets.
  for (int i = 0; i < n_pairs; i++) {
    struct pair *p = pairs + i;
    p->rank = 0;
    p->parent = i;
  }

  // Greedily add pairs to the Hamiltonian path so long as they don't cause a non-path!
  ip = 0;
  struct adjacent *adjs = calloc(n_sets, sizeof(struct adjacent));
  for (int i = 0; i < n_pairs; i++) {
    struct pair *p = pairs + i;
    struct adjacent *ai = adjs + p->i, *aj = adjs + p->j; 

    // Continue if we'd get a vertex with degree 3 by adding this edge.
    if (ai->n == 2 || aj->n == 2) continue;

    // Find (possibly) disjoint sets of pair's elements.
    INDEX i_set = find(pairs, p->i);
    INDEX j_set = find(pairs, p->j);

    // Continue if we'd form a cycle by adding this edge.
    if (i_set == j_set) continue;

    // Otherwise add this edge.
    do_union(pairs, i_set, j_set);
    ai->elts[ai->n++] = p->j;
    aj->elts[aj->n++] = p->i;

    // Done after we've added enough pairs to touch all sets in a path.
    if (++ip == n_sets - 1) break;
  }

  // Find a set with only one adjacency, the path start.
  int p = -1;
  for (int i = 0; i < n_sets; ++i) 
    if (adjs[i].n == 1) {
      p = i;
      break;
    }

  // A plan will be an ordering of sets.
  INDEX *plan = malloc(n_sets * sizeof(INDEX));

  // Walk along the path to get the ordering.
  for (int i = 0; i < n_sets; i++) {
    plan[i] = p;
    struct adjacent *a = adjs + p;
    // This logic figures out which adjacency takes us forward.
    p = a->elts[ a->n > 1 && a->elts[1] != plan[i-1] ];
  }

  // Done with intermediate data structures.
  free(pairs);
  free(adjs);

  return plan;
}

// A tiny test case.  Much more testing needed!

#define ARRAY_SIZE(A) (sizeof A / sizeof A[0])
#define SET(Elts) { ARRAY_SIZE(Elts), Elts }

// Items must be in ascending order for Hamming distance calculation.
ITEM a1[] = { 'a', 'b' };
ITEM a2[] = { 'a', 'b', 'c' };
ITEM a3[] = { 'a', 'b', 'c', 'd' };
ITEM a4[] = { 'a', 'b', 'c', 'd' };
ITEM a5[] = { 'b', 'c' };
ITEM a6[] = { 'c', 'd' };

// Out of order to see how we do.
struct set sets[] = { SET(a3), SET(a6), SET(a1), SET(a4), SET(a5), SET(a2) };

int main(void)
{
  int n_sets = ARRAY_SIZE(sets);
  INDEX *plan = make_plan(sets, n_sets);

  for (int i = 0; i < n_sets; i++) {
    struct set *s = sets + plan[i];
    printf("Step %d: ", i+1);
    for (int j = 0; j < s->n_elts; j++) printf("%c ", (char)s->elts[j]);
    printf("\n");
  }
  return 0;
}

- Gene

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Gene · Accepted Answer

确实，@irrelephant是正确的。这是一个无向哈密顿路径问题。将其建模为完全无向图，其中节点是sku集合，每条边的权重是相应集合之间的汉明距离。然后找到一个装箱顺序等同于找到一个恰好经过每个节点一次的路径。这是一个哈密顿路径（HP）。你想要最小权重的HP。

坏消息是，寻找最小权重的HP是NP完全问题，这意味着通常情况下，找到最优解需要指数时间。

好消息是，有合理的近似算法。显然的贪心算法给出的答案不会比最优HP差两倍。它是：

create the graph of Hamming distances
sort the edges by weight in increasing order: e0, e1, ...
set C = emptyset
for e in sequence e0, e1, ...
   if C union {e} does not cause a cycle nor a vertex with degree more than 2 in C
      set C = C union {e}
return C

请注意，使用经典的不相交集合并查找算法和顶点中的关联边计数器，可以在几乎恒定的时间内实现if语句测试。因此，假设汉明距离的计算是恒定时间，这里的运行时间可以为O(n^2 log n)，其中n为sku集的数量。

如果您不熟悉图形，可以想象一个三角形表格，其中每对sku集都有一个条目。表格中的条目是汉明距离。您需要对表格条目进行排序，然后按排序顺序逐个添加sku集对到您的计划中，跳过会导致“分叉”或“循环”的对。分叉将是一组如(a,b),(b,c),(b,d)的对。循环将是(a,b),(b,c),(c,a)。

还有一些更复杂的多项式时间算法，可以获得3/2的近似值。