如何将多个子集分配到它们最大的超集？

Question

如何将多个子集分配到它们最大的超集？

6

我的数据集有大量的子集（数百万个）。每个集合的大小在几个成员到几万个整数之间。许多这些集合都是更大集合的子集（有许多这样的超集）。我正在尝试将每个子集分配给它的最大超集。

请问有人可以推荐这种任务的算法吗？有许多用于生成一个集合所有可能子集的算法，但考虑到我的数据规模，这种方法非常耗时（例如此文件或此SO问题）。

我的数据集示例：

A {1, 2, 3}
B {1, 3}
C {2, 4}
D {2, 4, 9}
E {3, 5}
F {1, 2, 3, 7}

预期答案：B和A是F的子集（B也是A的子集不重要）；C是D的子集；E仍未分配。

- Dan

4个回答

0

所以你有数百万个集合，每个集合有成千上万的元素。仅表示该数据集需要数十亿个整数。在比较中，你会很快达到数万亿次操作，甚至不会流一滴汗。

因此，我假设你需要一个可以分布在许多机器上的解决方案。这意味着我将从https://en.wikipedia.org/wiki/MapReduce的角度进行思考。一系列的MapReduce。

读入集合，将它们映射为k:v对，其中是集合~~的元素~~。

接收一个整数键和一组集合。将它们映射为一对(s1, s2): i，其中s1 <= s2都是包含i的集合。不要忘记将每个集合与自身配对！

对于每对(s1, s2)，计算交集的大小k，并发送一对s1: k、s2: k。（仅在s1和s2不同时才发送第二个）

对于每个集合s，接收其超集的集合。如果它是极大的，则发送s: s。否则，对于每个t，如果t是s的严格超集，则发送t: s。

对于每个集合s，接收其子集的集合，只有当它是极大的时才将s列入列表。如果s是极大的，则对于每个t，如果t是s的子集，则发送t: s。

对于每个集合，我们接收它是其子集的极大集合的集合。（可能有很多。）

这需要很多步骤，但其核心是对每个共同元素的一组集合进行重复比较。潜在的时间复杂度为O(n * n * m)，其中n是集合的数量，m是许多集合中存在的不同元素的数量。

- btilly

0

这里有一个简单的算法建议，可能会根据你的数字（n=10^6 到 10^7 个集合，每个集合有m=2到10^5个成员，大量的超级/子集）给出更好的结果。当然这在很大程度上取决于您的数据。一般来说，复杂性比其他提出的算法要糟糕得多。也许您可以只处理那些少于X（例如1000）个成员的集合，并使用其他提出的方法来处理其余的集合。

按大小对集合进行排序。
删除第一个（最小）集合，并从后面（先从最大的集合开始）开始将其与其他集合进行比较。
找到一个超集时停止比较并创建关系。如果未找到超集，则仅删除。
为除最后一个集合外的所有集合重复步骤2和3。

- maraca

-1

如果您正在使用Excel，可以按以下方式构建它： 1）创建笛卡尔图作为双向表，其中所有数据集都作为标题出现在侧面和顶部 2）在单独的选项卡中，为第一列中的每个数据集创建一行，还要创建第二列，以计算条目数量（例如：F有4个）。然后在整个工作表上堆叠FIND（“，”）和MID公式，以拆分每个数据集中的所有条目。使用第二列中的计数器执行COUNTIF（“>0”）。您找到的每个变量都可以是随后查找的起点，直到它用尽变量并返回空白。 3）返回笛卡尔图，将刚刚生成的单独条目带到您的列标题中（例如：F是1,2,3,7）。使用AND语句检查左侧列中的每个条目是否在顶部行数据集中使用OFFSET到您的分离区域并利用计数器作为OFFSET的宽度。

- Duke

你意识到问题规模有多大了吗？ - Richard Yan

欢迎来到您的电子表格，它拥有数百万列和数百万行。 - donkopotamus

真的，但是那个基本逻辑仍然可以在R或SQL中使用，从根本上做相同的事情（因为他要求算法）。但是是的，对于这么大的数据集，它肯定不能在Excel中工作。 - Duke

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jacobm · Accepted Answer

这里有一个可能可行的想法：

- 构建一个表格，将数字映射到已排序的集合列表上，首先按大小排序，最大的在前面，然后按大小任意排序，但具有某些规范顺序。（比如，按字母顺序排列集合名称）。因此，在您的示例中，您将拥有一个将1映射到[F，A，B]，2映射到[F，A，D，C]，3映射到[F，A，B，E]等等的表格。这可以实现以O（n log n）时间进行，其中n是输入的总大小。 - 对于输入中的每个集合： - 获取与该集合中每个条目相关联的列表。因此，对于A，您将获取与1、2和3关联的列表。您将发出的所有选择的总数是O（n），因此迄今为止的运行时间为O（n log n + n），仍然是O（n log n）。 - 现在同时向下滑动每个列表。如果一个集合是三个列表中的第一个条目，则它是包含输入集合的最大集合。输出该关联并继续处理下一个输入列表。如果不是，则放弃所有输入列表中所有项中最小的项，然后重试。实施这一点很棘手，但您可以在堆中存储所有列表的头部，并获得（如果我没记错的话）大约O（n log k）的总体运行时间，其中k是任何单个集合的最大大小，因此您可以将其限制在最坏情况下的O（n log n）。 - 因此，如果我理解正确，算法的运行时间总体上是O（n log n），这似乎可能是您为此问题获得的最好结果。

以下是该算法的Python实现：

from collections import defaultdict, deque
import heapq

def LargestSupersets(setlists):
  '''Computes, for each item in the input, the largest superset in the same input.

setlists: A list of lists, each of which represents a set of items. Items must be hashable.
  '''
  # First, build a table that maps each element in any input setlist to a list of records
  # of the form (-size of setlist, index of setlist), one for each setlist that contains
  # the corresponding element
  element_to_entries = defaultdict(list)
  for idx, setlist in enumerate(setlists):
    entry = (-len(setlist), idx)  # cheesy way to make an entry that sorts properly -- largest first
    for element in setlist:
      element_to_entries[element].append(entry)

  # Within each entry, sort so that larger items come first, with ties broken arbitrarily by
  # the set's index
  for entries in element_to_entries.values():
    entries.sort()

  # Now build up the output by going over each setlist and walking over the entries list for
  # each element in the setlist. Since the entries list for each element is sorted largest to
  # smallest, the first entry we find that is in every entry set we pulled will be the largest
  # element of the input that contains each item in this setlist. We are guaranteed to eventually
  # find such an element because, at the very least, the item we're iterating on itself is in
  # each entries list.
  output = []
  for idx, setlist in enumerate(setlists):
    num_elements = len(setlist)
    buckets = [element_to_entries[element] for element in setlist]

    # We implement the search for an item that appears in every list by maintaining a heap and
    # a queue. We have the invariants that:
    #   1. The queue contains the n smallest items across all the buckets, in order
    #   2. The heap contains the smallest item from each bucket that has not already passed through
    #        the queue.
    smallest_entries_heap = []
    smallest_entries_deque = deque([], num_elements)
    for bucket_idx, bucket in enumerate(buckets):
      smallest_entries_heap.append((bucket[0], bucket_idx, 0))
    heapq.heapify(smallest_entries_heap)

    while (len(smallest_entries_deque) < num_elements or
           smallest_entries_deque[0] != smallest_entries_deque[num_elements - 1]):
      # First extract the next smallest entry in the queue ...
      (smallest_entry, bucket_idx, element_within_bucket_idx) = heapq.heappop(smallest_entries_heap)
      smallest_entries_deque.append(smallest_entry)

      # ... then add the next-smallest item from the bucket that we just removed an element from
      if element_within_bucket_idx + 1 < len(buckets[bucket_idx]):
        new_element = buckets[bucket_idx][element_within_bucket_idx + 1]
        heapq.heappush(smallest_entries_heap, (new_element, bucket_idx, element_within_bucket_idx + 1))

    output.append((idx, smallest_entries_deque[0][1]))

  return output

注意：不要过于信任我的写作。我刚想到这个算法，还没有证明它是否正确等等。