Python:基于交集的简单列表合并

54
考虑存在以下整数列表:

Consider there are some lists of integers as:

#--------------------------------------
0 [0,1,3]
1 [1,0,3,4,5,10,...]
2 [2,8]
3 [3,1,0,...]
...
n []
#--------------------------------------

这个问题是要合并至少有一个共同元素的列表。因此,仅针对给定部分的结果如下:

#--------------------------------------
0 [0,1,3,4,5,10,...]
2 [2,8]
#--------------------------------------

在大数据集上,如何最有效地实现此操作(元素仅为数字)? 树形结构是否值得考虑? 目前我通过将列表转换为set并迭代交集来完成工作,但速度较慢!此外,我觉得这是非常基础的!另外,我的实现缺少某些东西(未知),因为有时一些列表仍然未合并!话虽如此,如果您建议自己实现,请慷慨提供一个简单的示例代码[显然,Python 是我最喜欢的:)]或伪代码。
更新 1: 以下是我使用的代码:

#--------------------------------------
lsts = [[0,1,3],
        [1,0,3,4,5,10,11],
        [2,8],
        [3,1,0,16]];
#--------------------------------------

该函数有(漏洞!!):

#--------------------------------------
def merge(lsts):
    sts = [set(l) for l in lsts]
    i = 0
    while i < len(sts):
        j = i+1
        while j < len(sts):
            if len(sts[i].intersection(sts[j])) > 0:
                sts[i] = sts[i].union(sts[j])
                sts.pop(j)
            else: j += 1                        #---corrected
        i += 1
    lst = [list(s) for s in sts]
    return lst
#--------------------------------------
结果是:
#--------------------------------------
>>> merge(lsts)
>>> [0, 1, 3, 4, 5, 10, 11, 16], [8, 2]]
#--------------------------------------

更新2: 根据Niklas Baumstark提供的代码,对于简单情况来说,速度要稍快一些。还没有测试“Hooked”提供的方法,因为它是完全不同的方法(顺便说一句,它看起来很有趣)。 对于所有这些方法的测试程序可能非常困难或不可能确保结果。我将使用的真实数据集非常庞大且复杂,因此仅通过重复无法追踪任何错误。也就是说,在将其作为模块放入大型代码之前,我需要100%满意该方法的可靠性。因此,现在对于简单的数据集而言,Niklas的方法更快且答案当然是正确的。
但是我如何确信它对于真正的大型数据集有效呢? 因为我将无法通过视觉追踪错误!

更新3: 请注意,对于此问题,方法的可靠性比速度更重要。最终我希望能够将Python代码转换为Fortran以获得最大的性能。

更新4:
这篇文章和慷慨给出的答案、建设性的评论都有很多有趣的观点。我建议仔细阅读所有内容。感谢提出问题、给出惊人的答案、建设性的评论和讨论。


“大数据”是指许多列表或非常长的列表吗?也许聪明的多线程可以为您节省一些时间。 - Rik Poggi
@RikPoggi 几乎都是:许多列表,每个列表都可能很长。 - Developer
代码示例似乎不完整,特别是while语句。 - Janne Karila
@NiklasBaumstark,正如我之前所写,它从一开始就是完整的,但SO呈现中存在奇怪的困难。如果您尝试查看页面源代码,则可以进行检查。无论如何,我已将所有内容替换为HTML版本,现在对所有人都应该没问题了。 - Developer
听起来像是一种算法,用于给定图的各种块(例如每个节点及其边缘),您需要将连接的子图排序。 - Chris Morgan
显示剩余16条评论
19个回答

0

这比Niklas提供的方案慢(在test.txt上我得到了3.9s,而他的方案只用了0.5s),但产生相同的结果,可能更容易在例如Fortran中实现,因为它不使用集合(set),只对所有元素的总量进行排序,然后单次运行所有元素。

它返回一个合并列表的id列表,因此还跟踪空列表,它们保持未合并状态。

def merge(lsts):
        # this is an index list that stores the joined id for each list
        joined = range(len(lsts))
        # create an ordered list with indices
        indexed_list = sorted((el,index) for index,lst in enumerate(lsts) for el in lst)
        # loop throught the ordered list, and if two elements are the same and
        # the lists are not yet joined, alter the list with joined id
        el_0,idx_0 = None,None
        for el,idx in indexed_list:
                if el == el_0 and joined[idx] != joined[idx_0]:
                        old = joined[idx]
                        rep = joined[idx_0]
                        joined = [rep if id == old else id for id in joined]
                el_0, idx_0 = el, idx
        return joined

感谢分享您的想法。好处在于您使用了标签(即ID),这可以用于进一步的开发。 - Developer

0

我发现@Niklas B.的答案非常有帮助...但是我花了一段时间来阅读并理解逻辑。这是完全相同代码的重写,使用新的变量名称和更多的解释...以帮助其他初学者!

def mergeUntilOnlyDisjointSetsRemain(_listsOfLists):

    """Function for merging lists until there are only disjoint sets"""
    
    """Imagine this algorithm as if it were processing train cars filled with
    integers. It takes the first car of the train, separates it from the rest, 
    and then compares the first car to each subsequent car.
    Start by renaming the first car to 'common'
    If the two train cars have a common integer, you merge the two cars into
    common, and continue down the line until you reach the end of the train.
    Once you reach the end of the train, place the common car in results, (which
    is essentially a collection of train cars that have already been compared 
    to all other cars).
    You can exit the loop as soon as you get to the end of the train without
    merging any of the cars. This is controlled by the continueMerge variable.
    This variable is only reset to True after a merge operation.
    """

    # Start by creating a trainCar(i.e. a set) from each list in our listOfLists
    freightTrain = [set(trainCar) for trainCar in _listsOfLists if trainCar]

    # This continueMerge means that we have not yet compared all cars in the train.
    continueMerge = True
    while continueMerge:
        # Reset the while loop trigger.
        continueMerge = False
        # Create a fresh empty list of cars that have already been cross checked
        crossCheckedCars = []
        # While there are still elements in freightTrain
        while freightTrain:
            # Split the freightTrain up, into first car vs all the remaining cars
            commonFirstTrainCar = freightTrain[0]
            remainingCars = freightTrain[1:]
            # The freightTrain is now empty
            freightTrain = []
            # Iterate over all the remaining traincars
            for currentTrainCar in remainingCars:
                # If there are not any common integers with the first car...
                if currentTrainCar.isdisjoint(commonFirstTrainCar):
                    # Add each train car back onto the freightTrain
                    freightTrain.append(currentTrainCar)
                # But if they share a common integer...
                else:
                    # Trigger the reset switch to continueMerging cars
                    continueMerge = True
                    # and Join he cars together
                    commonFirstTrainCar |= currentTrainCar
            # Once we have checked commonFirstTrainCar, remove it from the 
            # freightTrain and place it in crossCheckedCars
            crossCheckedCars.append(commonFirstTrainCar)

        # Now we have compared the first car to all subsequent cars
        # (... but we aren't finished because the 5th and 7th cars might have
        # had a common integer with each other... but only 1st and 5th cars
        # shared an integer the first time around... so the 1st and 5th cars
        # were merged, but the 7th car is still alone!)

        # Reset the system by creating a new freightTrain
        freightTrain = crossCheckedCars

    # Post-process the freight train to turn it into lists instead of sets
    listsForReturnTripHome = []
    for processedTraincar in freightTrain:
        listsForReturnTripHome.append(list(processedTraincar))
    return listsForReturnTripHome

0
lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

import networkx as nx
g = nx.Graph()

for sub_list in lists:
    for i in range(1,len(sub_list)):
        g.add_edge(sub_list[0],sub_list[i])

print nx.connected_components(g)
#[[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]]

性能:

5000 lists, 5 classes, average size 74, max size 1000
15.2264976415

merge1的性能:

print timeit.timeit("merge1(lsts)", setup=setup, number=10)
5000 lists, 5 classes, average size 74, max size 1000
1.26998780571

所以它比最快的慢了11倍..但代码更简单易读!

0

首先,我不确定基准测试是否公平:

将以下代码添加到我的函数开头:

c = Counter(chain(*lists))
    print c[1]
"88"

这意味着在所有列表的所有值中,只有88个不同的值。通常在现实世界中,重复很少出现,您会期望有更多不同的值。(当然我不知道您的数据来自哪里,所以不能做出假设)。

由于重复更常见,因此集合不太可能是不相交的。这意味着set.isdisjoint()方法将更快,因为仅经过几次测试,它就会发现集合不是不相交的。

话虽如此,我确实相信使用不相交的方法是最快的,但我只是说,与其他基准测试方法相比,它们应该只比其他方法快10倍,而不是20倍。

无论如何,我想尝试一种稍微不同的技术来解决这个问题,但是合并排序太慢了,这种方法比使用基准测试的两种最快方法要慢大约20倍:

我想把所有东西都排序

import heapq
from itertools import chain
def merge6(lists):
    for l in lists:
        l.sort()
    one_list = heapq.merge(*[zip(l,[i]*len(l)) for i,l in enumerate(lists)]) #iterating through one_list takes 25 seconds!!
    previous = one_list.next()

    d = {i:i for i in range(len(lists))}
    for current in one_list:
        if current[0]==previous[0]:
            d[current[1]] = d[previous[1]]
        previous=current

    groups=[[] for i in range(len(lists))]
    for k in d:
        groups[d[k]].append(lists[k]) #add a each list to its group

    return [set(chain(*g)) for g in groups if g] #since each subroup in each g is sorted, it would be faster to merge these subgroups removing duplicates along the way.


lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]
print merge6(lists)
"[set([1, 2, 3, 5, 6]), set([8, 9, 10]), set([11, 12, 13])]""



import timeit
print timeit.timeit("merge1(lsts)", setup=setup, number=10)
print timeit.timeit("merge4(lsts)", setup=setup, number=10)
print timeit.timeit("merge6(lsts)", setup=setup, number=10)
5000 lists, 5 classes, average size 74, max size 1000
1.26732238315
5000 lists, 5 classes, average size 74, max size 1000
1.16062907437
5000 lists, 5 classes, average size 74, max size 1000
30.7257182826

您的观点对于所使用的数据是正确的。数据可以具有任何形式的关系,例如0n个连接列表。然而根据统计,只有10%到30%的数据是有价值的。 - Developer

0

我今天自己需要这个功能,然后看到了这个SO帖子和重复的帖子。首先,非常感谢所有参与讨论的人,特别是@alexis,他的答案对我有很大的影响,以及@Rik Poggi,他的GitHub存储库在测试和基准测试我的版本时非常有用。这就是使SO变得宝贵的讨论!

几点说明:

  • 我重新修改了存储库,将'deepcopy(lsts)'从执行过程中移动到设置中。使用timeit.repeat,我们有完全相同的行为和安全性,但deepcopy不会影响计时。这解释了为什么下面报告的计时如此低。
  • 我在存储库中添加了一个分析器脚本,因此如果您添加了您的函数,您可以轻松地对其进行分析以与任何列表进行比较。
  • 我对@Rik Poggi的存储库进行了分叉,在这里 here
  • 如上所述,我的版本是对@alexis答案的重新设计。我们都源自并查集算法,但主要区别在于我的版本更新合并后所有节点的父节点(bin_reflocatebin),我认为这样代码更清晰。
  • 这种方法做我们想要做的事情相当难以理解(除非你是一个数学家,我不是),所以我尽量让代码尽可能易读。我相信在这种情况下,“易读” -> “可维护”。
  • 我的版本对输入数据没有太多的假设。我的要求是list[tuple[Path]],所以我制作了一个不太关心内部数据的版本。只要您传递一个Iterable[Iterable[T]],其中T是可哈希的,就没问题。
关于结果:
- 在大型数据集上,我的版本似乎表现更好。当数据集变小时,它的性能稍微下降(在最坏情况下比最佳版本慢2.5倍)。对于中等大小的数据集,请选择@Rik Poggi的版本。对于小型数据集,请选择@alexis的版本。
from itertools import chain
from typing import Iterable, TypeVar

T = TypeVar('T')


def merge_intersections(lists: Iterable[Iterable[T]]) -> list[set[T]]:
    """
    Merges lists based on whether or not they intersect. Returns a list
    of mutually disjoint sets.

    How it works:
    The general idea is to iterate over the sequences and merge them in
    their "intersection-bins". To do so, for each sequence, we look for
    items that were already encountered during previous iteration. If we
    don't find any, this means the current sequence might be disjoint
    from the rest, so we put it in its own bin. If, on the other hand,
    we find previously encountered items, this means one or more bins
    already exist for these items. If more than one bin exists we merge
    them (updating the reference of all contained items). Finally we put
    the current sequence's items in the bin.

    Usage:
    ::
        >>> result = merge_intersections([[1, 2], [2, 4], [7, 8], [3, 2], [4, 5]])
        >>> result = sorted(sorted(s) for s in result)
        [[1, 2, 3, 4, 5], [7, 8]]
    """
    bins: dict[T: set[T]] = dict()
    bin_refs: dict[T: T] = dict()
    for lst in lists:
        if not lst:
            continue

        # Gather the bin refs of all items in the list that we have
        # already seen.
        encountered_items_bin_refs = {
            bin_refs[item]
            for item in lst
            if item in bin_refs
        }
        if len(encountered_items_bin_refs) >= 1:
            # Some of the items in `lst` have already been seen in a
            # previous iteration. They are therefore already attached
            # to a bin. Select any of their corresponding bin ref.
            bin_ref = encountered_items_bin_refs.pop()
            # If the previously-seen items were not all attached to the
            # same bin, their respective bins need to be merged into
            # the selected one.
            if len(encountered_items_bin_refs) > 0:
                to_merge_bins = [bins.pop(ref) for ref in encountered_items_bin_refs]
                bins[bin_ref].update(chain(*to_merge_bins))
                for item in chain(*to_merge_bins):
                    bin_refs[item] = bin_ref
            bins[bin_ref].update(lst)
        else:
            # None of the items in `lst` have already been seen in a
            # previous iteration. Therefore, we can safely pick any
            # item as our new bin ref and create the corresponding bin.
            bin_ref = next(iter(lst))
            bins[bin_ref] = set(lst)
        for item in lst:
            bin_refs[item] = bin_ref
    return list(bins.values())

基准测试结果:
Timing with: >> Niklas << Benchmark
Info: 5000 lists, average size 305, max size 999
(from file: ./lists/timing_1.txt)
0.200 (1x) -- takeshi
0.328 (1.6x) -- alexis
0.473 (2.4x) -- agf
0.746 (3.7x) -- Rik. Poggi
0.776 (3.9x) -- Niks rewrite
0.792 (4x) -- Niklas B.
1.419 (7.1x) -- agf (optimized)
1.480 (7.4x) -- ChessMaster
1.579 (7.9x) -- katrielalex
1.627 (8.2x) -- steabert
1.938 (9.7x) -- robert king
9.273 (46x) -- Sven Marnach
33.779 (169x) -- hochl

Timing with: >> Niklas << Benchmark
Info: 4500 lists, average size 305, max size 999
(from file: ./lists/timing_2.txt)
0.158 (1x) -- takeshi
0.166 (1x) -- agf
0.231 (1.5x) -- Niks rewrite
0.232 (1.5x) -- Rik. Poggi
0.233 (1.5x) -- Niklas B.
0.268 (1.7x) -- alexis
0.408 (2.6x) -- ChessMaster
0.422 (2.7x) -- agf (optimized)
1.408 (8.9x) -- katrielalex
1.483 (9.4x) -- steabert
1.924 (12x) -- robert king
8.622 (55x) -- Sven Marnach
25.952 (164x) -- hochl

Timing with: >> Niklas << Benchmark
Info: 4500 lists, average size 98, max size 999
(from file: ./lists/timing_3.txt)
0.059 (1x) -- takeshi
0.068 (1.1x) -- agf
0.094 (1.6x) -- alexis
0.095 (1.6x) -- Rik. Poggi
0.095 (1.6x) -- Niklas B.
0.097 (1.6x) -- Niks rewrite
0.183 (3.1x) -- ChessMaster
0.192 (3.2x) -- agf (optimized)
0.425 (7.2x) -- katrielalex
0.586 (9.9x) -- robert king
0.787 (13x) -- steabert
2.827 (48x) -- Sven Marnach
9.162 (155x) -- hochl

Timing with: >> Sven << Benchmark
Info: 200 lists, average size 10, max size 10
0.000 (1x) -- alexis
0.001 (1.3x) -- ChessMaster
0.001 (1.8x) -- agf (optimized)
0.001 (1.9x) -- takeshi
0.002 (3.7x) -- steabert
0.002 (3.7x) -- robert king
0.002 (4.2x) -- katrielalex
0.002 (4.6x) -- Rik. Poggi
0.002 (4.8x) -- agf
0.002 (5.1x) -- Niklas B.
0.002 (5.5x) -- hochl
0.003 (6.1x) -- Niks rewrite
0.003 (7.7x) -- Sven Marnach

Timing with: >> Agf << Benchmark
Info: 2000 lists, average size 253, max size 500
0.030 (1x) -- Rik. Poggi
0.035 (1.2x) -- ChessMaster
0.036 (1.2x) -- agf
0.036 (1.2x) -- agf (optimized)
0.037 (1.2x) -- Niks rewrite
0.038 (1.3x) -- Niklas B.
0.060 (2x) -- hochl
0.074 (2.5x) -- takeshi
0.118 (3.9x) -- alexis
0.520 (17x) -- katrielalex
0.528 (18x) -- steabert
0.694 (23x) -- robert king
34.746 (1158x) -- Sven Marnach

测试结果:
### Going to test: takeshi ###

test_coverage (__main__.MergeTestCase)
Check coverage original data ... ok
test_disjoint (__main__.MergeTestCase)
Check disjoint-ness of merged results ... ok
test_subset (__main__.MergeTestCase)
Check that every original data is a subset ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.008s

OK

-1

这个问题可以通过使用并查集算法在O(n)时间内解决。给定您的数据的前两行,用于并查集的边是以下几对:

(0,1),(1,3),(1,0),(0,3),(3,4),(4,5),(5,10)

-1

我的解决方案在小列表上运行良好,而且没有依赖项,代码也很易读。

def merge_list(starting_list):
    final_list = []
    for i,v in enumerate(starting_list[:-1]):
        if set(v)&set(starting_list[i+1]):
            starting_list[i+1].extend(list(set(v) - set(starting_list[i+1])))
        else:
            final_list.append(v)
    final_list.append(starting_list[-1])
    return final_list

对其进行基准测试:

列表 = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

%timeit merge_list(列表)

100000 次循环,3 次取最佳结果:每次循环平均耗时 4.9 微秒


-1
from itertools import combinations


def merge(elements_list):
    d = {index: set(elements) for index, elements in enumerate(elements_list)}

    while any(not set.isdisjoint(d[i], d[j]) for i, j in combinations(d.keys(), 2)):
        merged = set()
        for i, j in combinations(d.keys(), 2):
            if not set.isdisjoint(d[i], d[j]):
                d[i] = set.union(d[i], d[j])
                merged.add(j)

        for k in merged:
            d.pop(k)    

    return [v for v in d.values() if v]

lst = [[65, 17, 5, 30, 79, 56, 48, 62],
       [6, 97, 32, 93, 55, 14, 70, 32],
       [75, 37, 83, 34, 9, 19, 14, 64],
       [43, 71],
       [],
       [89, 49, 1, 30, 28, 3, 63],
       [35, 21, 68, 94, 57, 94, 9, 3],
       [16],
       [29, 9, 97, 43],
       [17, 63, 24]]

print(merge(lst))

-1
使用 flag 确保您获得最终的互斥结果。
def merge(lists): 
    while(1):
        flag=0
        for i in range(0,len(lists)):
            for j in range(i+1,len(lists)):
                if len(intersection(lists[i],lists[j]))!=0:
                    lists[i]=union(lists[i],lists[j])
                    lists.remove(lists[j])
                    flag+=1
                    break
        if flag==0:
            break
    return lists

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接