从一个列表的列表中创建一个包含集合的numpy数组

5

我正在尝试从一个列表中提取群组,该主列表包含不同长度的列表。 我想高效地将所有包含另一个子列表中至少一个值的子列表进行分组。例如,我希望得到以下结果:

[[2],[5],[5,8,16],[7,9,12],[9,20]]

被分组成这样

my_groups = {"group1":[2], "group2":[5,8,16], "group3":[7,9,12,20]}

我想做的方法是将子列表转换为集合,然后使用reduce intersect1d。
reduce(np.intersect1d, (SUPERLIST))

我希望能将结果放入一个字典中。但是我不知道如何在不遍历列表的情况下将其转化为一组集合的数组。

有没有一种方法可以做到这一点,或者我忽略了更有效率的方法?

编辑

如果没有numpy,我会像这样做:

my_dict = dict()
unique_id = 0
for sub_list_ref in super_list:
    sub_set_ref = set(sub_list_ref)
    for sub_list_comp in super_list:
        sub_set_comp = set(sub_list_comp)
        if len(sub_set_ref.intersection(sub_set_comp))>0:
            my_dict[unique_id] = sub_set_ref.union(sub_set_comp)
            updated_id = unique_id+1
    if updated_id == unique_id:
        my_dict[unique_id] = sub_list_ref
    else:
        unique_id = updated_id

编辑2

昨天DarryIG给出了一份非常好的答案,今天我成功地制作了一个更高效的版本。

def coalesce_groups(current_list, super_list, iterations):
    if iterations <= 0:
        print("YOU HAVE ITERATED MORE THAN 3 TIMES")

    tmp_list = current_list
    for element in current_list:
        tmp_list = tmp_list + super_list[element]
    # Take only the unique elements
    tmp_list = list(set(tmp_list))

    if tmp_list == list(set(current_list)):
        return tmp_list
    else:
        iterations-=1
        return coalesce_groups(tmp_list, super_list, iterations)

def dict_of_groups(original_list):
    lst = list(original_list).copy()
    result_list = []

    for it in lst:
        result = coalesce_groups(it, lst, iterations = 3)
        if len(result)!=0:
            result_list.append(result)
        for t in result:
            lst[t] = []
    result_dict = { x : result_list[x] for x in range(0, len(result_list) ) }
    return result_dict

在测试中(在jupyter笔记本上)

lst = [[0], [1], [2], [3], [4], [5], [16, 6], [8, 10, 18, 21, 27, 7], [10, 19, 21, 27, 40, 8], [13, 20, 22, 26, 28, 30, 34, 41, 42, 50, 9], [18, 21, 27, 10], [11], [12], [20, 22, 26, 28, 30, 34, 41, 42, 50, 13], [14], [15], [16], [25, 17], [21, 27, 40, 18], [21, 27, 40, 19], [22, 26, 28, 30, 34, 41, 42, 50, 20], [27, 40, 21], [26, 28, 30, 34, 41, 42, 50, 22], [23], [24], [25], [28, 30, 34, 41, 42, 50, 26], [40, 27], [30, 34, 41, 42, 50, 28], [29], [34, 41, 42, 50, 30], [33, 31], [32], [33], [41, 42, 50, 34], [35], [36], [37], [38], [39], [40], [42, 50, 41], [50, 42], [43], [44], [45], [46], [47], [49, 48], [49], [50]]
%lprun -T lprof0 -f generate_groups generate_groups(lst)
print(open('lprof0', 'r').read())
%lprun -T lprof1 -f dict_of_groups dict_of_groups(lst)
print(open('lprof1', 'r').read())

虽然其他回答也会被考虑,但我认为他的回答仍然有效且更全面。

DarryIG仍然是王者。


(Note: I kept the original HTML tags and translated the content into simplified Chinese.)

1
列表不能是n维的。numpy数组可以是n维的,但你没有展示任何。但考虑到你的例子,这看起来像是Python字典、集合和列表的问题。试图使用numpy只会让事情更加复杂。因此,首先展示一下你在没有numpy的情况下会怎么做。 - hpaulj
我修改了问题以回应您的评论。我知道我所写的内容存在一些边缘情况,但总体上它能够传达出想要表达的意思。 - Jorge Diaz
Python的集合操作基于哈希表,就像它的字典一样。NumPy的集合操作则使用排序。 - hpaulj
你的示例代码“使用numpy”产生了错误的结果,即:{0: {2},1: {8, 16, 5},2: {16, 5, 8},3: {20, 7, 9, 12},4: {9, 20}}。输出的子列表过多。 - DarrylG
为什么您想要将 set 对象放入 numpy.ndarray 对象中呢?这似乎毫无意义。 - juanpa.arrivillaga
1个回答

4
可以使用图中的并查集算法来高效地完成此操作(请参见https://www.geeksforgeeks.org/union-find-algorithm-set-2-union-by-rank/)。
我们将每个子列表视为图中的一个顶点。
如果它们的子列表重叠(即相交),则两个顶点是相连的。
并查集提供了一种高效的方法来查找所有不重叠顶点的不相交子集。
from collections import defaultdict 

# a structure to represent a graph 
class Graph: 

    def __init__(self, num_of_v): 
        self.num_of_v = num_of_v 
        self.edges = defaultdict(list) 

    # graph is represented as an  
    # array of edges 
    def add_edge(self, u, v): 
        self.edges[u].append(v) 


class Subset: 
    def __init__(self, parent, rank): 
        self.parent = parent 
        self.rank = rank 

    def __repr__(self):
        return {'name':self.parent, 'age':self.rank}

    def __str__(self):
        return 'Subset(parent='+str(self.parent)+', rank='+str(self.rank)+ ')'

# A utility function to find set of an element 
# node(uses path compression technique) 
def find(subsets, node): 
    if subsets[node].parent != node: 
        subsets[node].parent = find(subsets, subsets[node].parent) 
    return subsets[node].parent 

# A function that does union of two sets  
# of u and v(uses union by rank) 
def union(subsets, u, v): 

    # Attach smaller rank tree under root  
    # of high rank tree(Union by Rank) 
    if subsets[u].rank > subsets[v].rank: 
        subsets[v].parent = u 
    elif subsets[v].rank > subsets[u].rank: 
        subsets[u].parent = v 

    # If ranks are same, then make one as  
    # root and increment its rank by one 
    else: 
        subsets[v].parent = u 
        subsets[u].rank += 1

def find_disjoint_sets(graph): 

    # Allocate memory for creating sets 
    subsets = [] 

    for u in range(graph.num_of_v): 
        subsets.append(Subset(u, 0)) 

    # Iterate through all edges of graph,  
    # find sets of both vertices of every  
    # edge, if sets are same, then there 
    # is cycle in graph. 
    for u in graph.edges: 
        u_rep = find(subsets, u) 

        for v in graph.edges[u]: 
            v_rep = find(subsets, v) 

            if u_rep == v_rep: 
                continue
            else: 
                union(subsets, u_rep, v_rep) 

    return subsets

def generate_groups(lst):
  """ Finds disjoint sublists in lst.  Performs a union of sublists that intersect """
  # Generate graph
  g = Graph(len(lst))

  # Loop over all pairs of subists, 
  # Place an edge in the graph for sublists that intersect
  for i1, v1 in enumerate(lst):
    set_v1 = set(v1)
    for i2, v2 in enumerate(lst):
      if i2 > i1 and set_v1.intersection(v2):
        g.add_edge(i1, i2)

  # Disjoint subsets of sublists
  subsets = find_disjoint_sets(g)

  # Union of sublists which are non-disjoint (i.e. have the same parent)
  d = {}
  for i in range(len(lst)):
    sublist_index = find(subsets, i)
    if not sublist_index in d:
      d[sublist_index] = set()

    d[sublist_index] = d[sublist_index].union(lst[i])

  return d

# Test Code
lst = [[2],[5],[5,8,16],[7,9,12],[9,20]]

d = generate_groups(lst)
print(d)

输出

{0: {2}, 1: {8, 16, 5}, 3: {9, 12, 20, 7}}


非常感谢DarryIG。除非有人能够使用numpy回答问题,否则我将标记其为正确答案。 - Jorge Diaz
@JorgeDiaz--我正在寻找方法来提高Python中慢的for循环的性能。我会在此后回复你。 - DarrylG

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接