高效地删除元组列表中的部分重复项

9
我有一个元组列表,该列表的长度可以在8到1000之间变化,这取决于元组的长度。列表中的每个元组都是唯一的,并且每个元组都由N个单词组成。
一个示例元组可能具有长度N (Word 1, Word 2, Word 3, ..., Word N)
对于列表中的任何元组,其中的第j个元素将是''Word j
一个非常简化的例子是由字母组成的元组。
l = [('A', 'B', '', ''), ('A', 'B', 'C', ''), 
     ('', '', '', 'D'), ('A', '', '', 'D'), 
     ('', 'B', '', '')]

每个元组中的每个位置要么具有相同的值,要么为空。 我想要删除所有那些在另一个元组中,相同位置上所有非 '' 值都已存在的元组。 例如,(A,B,'','') 元组中所有非 '' 值都存在于 (A,B,C,'') 元组中,因此应将其删除。

filtered_l = [(A,B,C,''),(A,'','',D)]

元组的长度始终相同(不一定为4)。元组的长度将介于2-10之间。

如何以最快的方式完成此操作?


1
看起来你已经有了一些好的答案,但你可能也想研究一下集合操作。例如:https://stackoverflow.com/questions/52987903/tuple-is-subset-of-another-tuple-apriori-algortihm 或 https://dev59.com/wGUp5IYBdhLWcg3wo4qc - chris
@kepr 我想我已经将它简化成了一行代码(如下)。看看是否符合你的要求。 - pylang
6个回答

6

我们可以将每个元组概念化为一个二进制数组,其中1表示“包含某些内容”,2表示“包含空字符串”。由于在每个位置上的项目都是相同的,因此我们不需要关心每个位置上的具体内容,只需要知道是否有内容存在。

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
l_bin = [sum(2**i if k else 0 for i,k in enumerate(tup)) for tup in l]
# [3, 7, 8, 9, 2]
# [0b0011, 0b0111, 0b1000, 0b1001, 0b0010]
# that it's backwards doesn't really matter, since it's consistent

现在,我们可以遍历这个列表并构建一个没有“重复项”的新数据结构。由于我们的元组已被编码为二进制,因此我们可以通过执行按位运算来确定是否有重复项被“包含”在另一个元组中 - 给定ab,如果a | b == a,那么a必须包含b
codes = {}
for tup, b in zip(l, l_bin):
    # check if any existing code contains the potential new one
    # in this case, skip adding the new one
    if any(a | b == a for a in codes):
        continue
    # check if the new code contains a potential existing one or more
    # in which case, replace the existing code(s) with the new code
    for a in list(codes):
        if b | a == b:
            codes.pop(a)
    # and finally, add this code to our datastructure
    codes[b] = tup

现在我们可以撤回我们的“筛选过”的元组列表:
output = list(codes.values())
# [('A', 'B', 'C', ''), ('A', '', '', 'D')]

请注意,(A, B, C, '') 包含了 (A, B, '', '')('', B, '', ''),而 (A, '', '', D') 包含了 ('', '', '', D),因此这应该是正确的。
从 python 3.8 开始,dict 会保留插入顺序,因此输出应该与元组最初出现在列表中的顺序相同。
该解决方案可能不完全高效,因为代码数量可能会堆积,但它应该介于 O(n) 和 O(n^2) 之间,具体取决于最后剩余的唯一代码数量(由于每个元组的长度明显小于 l 的长度,因此它应该更接近于 O(n) 而不是 O(n^2)。

5

对于这个限制条件,显然的解决方案是将每个元组转换为位掩码,累积它们到一个计数器数组中,执行子集和变换,然后过滤数组l

详见代码注释的解释。

时间复杂度显然为n + m * 2^m,其中n是元组的数量,m是每个元组的长度。 对于n == 1000m == 10,这比n^2明显更快。

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
# assumes that l is not empty. (to access l[0])
# The case where l is empty is trivial to handle.

def tuple_to_mask(tuple_):
    # convert the information whether each value in (tuple_) is empty to a bit mask
    # (1 is empty, 0 is not empty)
    return sum((value == '') << index for index, value in enumerate(tuple_))


count = [0] * (1 << len(l[0]))
for tuple_ in l:
    # tuple_ is a tuple.
    count[tuple_to_mask(tuple_)] += 1

# now count[mask] is the number of tuples in l with that mask

# transform the count array.
for dimension in range(len(l[0])):
    for mask in range(len(count)):
        if mask >> dimension & 1:
            count[mask] += count[mask - (1 << dimension)]

# now count[mask] is the number of tuples in l with a mask (mask_) such that (mask) contains (mask_)
# (i.e. all the bits that are set in mask_ are also set in mask)


filtered_l = [tuple_ for tuple_ in l if count[tuple_to_mask(tuple_)] == 1]
print(filtered_l)

4

我不确定这是否是最有效或最符合Python习惯的方式,但这是一个直接的方法(也许其他人会使用更复杂的列表推导方法):

看一下这个:

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]

def item_in_list(item, l):
    for item2comp in l:
        if item!=item2comp:
            found = True
            for part,rhs_part in zip(item, item2comp):
                if part!='' and part!=rhs_part:
                    found = False
                    break
            if found:
                return True
    return False
            
                
            
new_arr = []
for item in l:
    if not item_in_list(item, l):
        new_arr.append(item)
print(new_arr)

输出:

[('A', 'B', 'C', ''), ('A', '', '', 'D')]

据我所见,时间复杂度为 - O((N**2)*M)。

N - 列表中元素的数量。

M - 每个元素中部分的数量。


4
L = [('A', 'B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
keys = collections.defaultdict(lambda: collections.defaultdict(set))

# maintain a record of tuple-indices that contain each character in each position
for i,t in enumerate(L):
    for c,e in enumerate(t):
        if not e: continue
        keys[e][c].add(i)

delme = set()
for i,t in enumerate(L):
    collocs = set.intersection(*[keys[e][c] for c,e in enumerate(t) if e])
    if len(collocs)>1:  # if all characters appear in this position in >1 index
        # ignore the collocation with the most non-empty characters
        # mark the rest for deletion
        C = max(collocs, key=lambda i: sum(bool(e) for bool in L[i]))
        for c in collocs:
            if c!=C: delme.add(c)

filtered = [t for i,t in enumerate(L) if i not in delme]

应该是 sum(bool(e) for e in L[i])) - dawg

4

这些字符串总是在同一个位置,因此我用布尔值替换它们,以便更容易地进行比较。首先进行排序,然后只保留元素,如果与所有其他元素相比,前一个元素无论在哪里都为 true 或与后续元素相同。然后完成比较后,将其从列表中删除。

f = sorted(map(lambda x: list(map(bool, x)), l), key=sum, reverse=True)

to_keep = []

while len(f) > 1:
    if all(map(lambda x, y: True if x == y or x else False, f[0], f[1])):
        to_keep.append(len(l) - len(f) + 1)
    f = f[1:]

print([l[i] for i in to_keep])

[('A', 'B', 'C', ''), ('A', '', '', 'D')]

在43.7微秒内,它的速度也是最高票答案的两倍。


这对我不起作用。例如:l = [(A,'',C),('',B,C),('','',C),(A,'',''),('','',''),(A,B,''),('',B,'')]您的解决方案仅返回[('',B,'')],它应该返回[(A,'',C),('',B,C),(A,B,'')]。 - kspr

1
考虑每个序列都是一个集合。现在我们只需丢弃所有子集。
给定:
import itertools as it


expected = {("A", "B", "C", ""), ("A", "", "", "D")}
data = [
    ("A", "B", "", ""),
    ("A", "B", "C", ""), 
    ("", "", "", "D"), 
    ("A", "", "", "D"), 
    ("", "B", "", "")
]

代码

一种将集合转换并比较的迭代解决方案。

def discard_subsets(pool: list) -> set:
    """Return a set without subsets."""
    discarded = set()

    for n, k in it.product(pool, repeat=2):                 # 1

        if set(k) < set(n)):                                # 2
            discarded.add(k)

    return set(pool) - discarded                            # 3

一种类似的单行解决方案。
set(data) - {k for n, k in it.product(data, repeat=2) if set(k) < set(n)}

演示

discard_subsets(data)
# {('A', '', '', 'D'), ('A', 'B', 'C', '')}

细节

后面的函数是有注释的,以帮助解释每个部分:

  1. 将所有元素相互比较(或使用嵌套循环)。
  2. 如果一个元素是一个真子集(见下文),则舍弃它。
  3. 从池中删除被舍弃的元素。

为什么要使用集合?

池中的每个元素都可以是一个集合,因为相关的子元素是唯一的,即 "A","B","C","D",""

集合具有成员属性。因此,例如,

("A", "B", "", "") 具有 ("A", "B", "C", "") 中的所有值

也可以表示为

集合 {"A", "B", "", ""} 是集合 {"A", "B", "C", ""} 的子集

现在只需要比较所有元素并拒绝所有 真子集

a, a_, ac = {"a"}, {"a"}, {"a", "c"}

# Subsets
assert a.issubset(a_)                                       
assert a <= a_
assert a <= ac

# Proper subsets
assert not a < a_
assert a < ac

复杂度

由于我们基本上有嵌套循环,最好的情况下我们得到O(n^2)的复杂度。这可能不是最有效的方法,但希望足够清晰易懂。

测试

f = discard_subsets
assert {("A", "B", "C", "")} == f([("A", "B", "", ""), ("A", "B", "C", "")])
assert {("A", "B", "C", "")} == f([("A", "B", "C", ""), ("A", "B", "", "")])
assert {("A", "B", "C", ""), ("", "", "", "D")} == f([("A", "B", "", ""), ("A", "B", "C", ""), ("", "", "", "D")])
assert {("A", "B", "C", ""), ("", "", "", "D")} == f([("", "", "", "D"), ("A", "B", "", ""), ("A", "B", "C", "")])
assert {("A", "B", "C", ""), ("", "", "", "D")} == f([("A", "B", "C", ""), ("", "", "", "D"), ("A", "B", "", "")])
assert {("A", "B", "C", ""), ("", "", "", "D")} == f([("A", "B", "C", ""), ("A", "B", "", ""), ("", "", "", "D")])
assert {("A","","C"), ("","B","C"), ("A","B","")} == f([("A","","C"),("","B","C"),("","","C"),("A","",""),("","",""),("A","B",""),("","B","")])
assert set(expected) == f(data)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接