from collections import defaultdict, deque
import heapq
def LargestSupersets(setlists):
'''Computes, for each item in the input, the largest superset in the same input.
setlists: A list of lists, each of which represents a set of items. Items must be hashable.
'''
# First, build a table that maps each element in any input setlist to a list of records
# of the form (-size of setlist, index of setlist), one for each setlist that contains
# the corresponding element
element_to_entries = defaultdict(list)
for idx, setlist in enumerate(setlists):
entry = (-len(setlist), idx) # cheesy way to make an entry that sorts properly -- largest first
for element in setlist:
element_to_entries[element].append(entry)
# Within each entry, sort so that larger items come first, with ties broken arbitrarily by
# the set's index
for entries in element_to_entries.values():
entries.sort()
# Now build up the output by going over each setlist and walking over the entries list for
# each element in the setlist. Since the entries list for each element is sorted largest to
# smallest, the first entry we find that is in every entry set we pulled will be the largest
# element of the input that contains each item in this setlist. We are guaranteed to eventually
# find such an element because, at the very least, the item we're iterating on itself is in
# each entries list.
output = []
for idx, setlist in enumerate(setlists):
num_elements = len(setlist)
buckets = [element_to_entries[element] for element in setlist]
# We implement the search for an item that appears in every list by maintaining a heap and
# a queue. We have the invariants that:
# 1. The queue contains the n smallest items across all the buckets, in order
# 2. The heap contains the smallest item from each bucket that has not already passed through
# the queue.
smallest_entries_heap = []
smallest_entries_deque = deque([], num_elements)
for bucket_idx, bucket in enumerate(buckets):
smallest_entries_heap.append((bucket[0], bucket_idx, 0))
heapq.heapify(smallest_entries_heap)
while (len(smallest_entries_deque) < num_elements or
smallest_entries_deque[0] != smallest_entries_deque[num_elements - 1]):
# First extract the next smallest entry in the queue ...
(smallest_entry, bucket_idx, element_within_bucket_idx) = heapq.heappop(smallest_entries_heap)
smallest_entries_deque.append(smallest_entry)
# ... then add the next-smallest item from the bucket that we just removed an element from
if element_within_bucket_idx + 1 < len(buckets[bucket_idx]):
new_element = buckets[bucket_idx][element_within_bucket_idx + 1]
heapq.heappush(smallest_entries_heap, (new_element, bucket_idx, element_within_bucket_idx + 1))
output.append((idx, smallest_entries_deque[0][1]))
return output
所以你有数百万个集合,每个集合有成千上万的元素。仅表示该数据集需要数十亿个整数。在比较中,你会很快达到数万亿次操作,甚至不会流一滴汗。
因此,我假设你需要一个可以分布在许多机器上的解决方案。这意味着我将从https://en.wikipedia.org/wiki/MapReduce的角度进行思考。一系列的MapReduce。
(s1, s2): i
,其中s1 <= s2
都是包含i的集合。不要忘记将每个集合与自身配对!(s1, s2)
,计算交集的大小k
,并发送一对s1: k
、s2: k
。(仅在s1
和s2
不同时才发送第二个)s
,接收其超集的集合。如果它是极大的,则发送s: s
。否则,对于每个t
,如果t
是s
的严格超集,则发送t: s
。s
,接收其子集的集合,只有当它是极大的时才将s
列入列表。如果s
是极大的,则对于每个t
,如果t
是s
的子集,则发送t: s
。这需要很多步骤,但其核心是对每个共同元素的一组集合进行重复比较。潜在的时间复杂度为O(n * n * m)
,其中n
是集合的数量,m
是许多集合中存在的不同元素的数量。
这里有一个简单的算法建议,可能会根据你的数字(n=10^6 到 10^7 个集合,每个集合有m=2到10^5个成员,大量的超级/子集)给出更好的结果。当然这在很大程度上取决于您的数据。一般来说,复杂性比其他提出的算法要糟糕得多。也许您可以只处理那些少于X(例如1000)个成员的集合,并使用其他提出的方法来处理其余的集合。
按大小对集合进行排序。
删除第一个(最小)集合,并从后面(先从最大的集合开始)开始将其与其他集合进行比较。
找到一个超集时停止比较并创建关系。如果未找到超集,则仅删除。
为除最后一个集合外的所有集合重复步骤2和3。
m
的两个事物进行一次比较的时间复杂度为O(m)
。现在重复这个操作n
次,以一次性获取所有结果。你的复杂度会大幅增加。 - btilly2 [F,A,D,C]
和4 [D,C]
开始;由于它们不以相同的元素开头,因此您会从第一个列表中丢弃F
并重试。现在您有了[A,D,C]
和[D,C]
。您仍然没有相同的第一个元素,因此您会丢弃A
。现在您有了[D,C]
和[D,C]
。它们都有相同的第一个元素D
,因此您返回D
,这是正确的。 - jacobm