高效地删除元组列表中的部分重复项

Question

高效地删除元组列表中的部分重复项

pythonlistperformancetuples

9

我有一个元组列表，该列表的长度可以在8到1000之间变化，这取决于元组的长度。列表中的每个元组都是唯一的，并且每个元组都由N个单词组成。

一个示例元组可能具有长度N (Word 1, Word 2, Word 3, ..., Word N)

对于列表中的任何元组，其中的第j个元素将是''或Word j

一个非常简化的例子是由字母组成的元组。

l = [('A', 'B', '', ''), ('A', 'B', 'C', ''), 
     ('', '', '', 'D'), ('A', '', '', 'D'), 
     ('', 'B', '', '')]

每个元组中的每个位置要么具有相同的值，要么为空。我想要删除所有那些在另一个元组中，相同位置上所有非 '' 值都已存在的元组。例如，(A,B,'','') 元组中所有非 '' 值都存在于 (A,B,C,'') 元组中，因此应将其删除。

filtered_l = [(A,B,C,''),(A,'','',D)]

元组的长度始终相同（不一定为4）。元组的长度将介于2-10之间。

如何以最快的方式完成此操作？

- kspr

1

看起来你已经有了一些好的答案，但你可能也想研究一下集合操作。例如：https://stackoverflow.com/questions/52987903/tuple-is-subset-of-another-tuple-apriori-algortihm 或 https://dev59.com/wGUp5IYBdhLWcg3wo4qc - chris

@kepr 我想我已经将它简化成了一行代码（如下）。看看是否符合你的要求。 - pylang

6个回答

5

对于这个限制条件，显然的解决方案是将每个元组转换为位掩码，累积它们到一个计数器数组中，执行子集和变换，然后过滤数组l。

详见代码注释的解释。

时间复杂度显然为n + m * 2^m，其中n是元组的数量，m是每个元组的长度。对于n == 1000和m == 10，这比n^2明显更快。

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
# assumes that l is not empty. (to access l[0])
# The case where l is empty is trivial to handle.

def tuple_to_mask(tuple_):
    # convert the information whether each value in (tuple_) is empty to a bit mask
    # (1 is empty, 0 is not empty)
    return sum((value == '') << index for index, value in enumerate(tuple_))


count = [0] * (1 << len(l[0]))
for tuple_ in l:
    # tuple_ is a tuple.
    count[tuple_to_mask(tuple_)] += 1

# now count[mask] is the number of tuples in l with that mask

# transform the count array.
for dimension in range(len(l[0])):
    for mask in range(len(count)):
        if mask >> dimension & 1:
            count[mask] += count[mask - (1 << dimension)]

# now count[mask] is the number of tuples in l with a mask (mask_) such that (mask) contains (mask_)
# (i.e. all the bits that are set in mask_ are also set in mask)


filtered_l = [tuple_ for tuple_ in l if count[tuple_to_mask(tuple_)] == 1]
print(filtered_l)

- user202729

4

我不确定这是否是最有效或最符合Python习惯的方式，但这是一个直接的方法（也许其他人会使用更复杂的列表推导方法）：

看一下这个：

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]

def item_in_list(item, l):
    for item2comp in l:
        if item!=item2comp:
            found = True
            for part,rhs_part in zip(item, item2comp):
                if part!='' and part!=rhs_part:
                    found = False
                    break
            if found:
                return True
    return False
            
                
            
new_arr = []
for item in l:
    if not item_in_list(item, l):
        new_arr.append(item)
print(new_arr)

输出：

[('A', 'B', 'C', ''), ('A', '', '', 'D')]

据我所见，时间复杂度为 - O((N**2)*M)。

N - 列表中元素的数量。

M - 每个元素中部分的数量。

- Yossi Levi

4

L = [('A', 'B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
keys = collections.defaultdict(lambda: collections.defaultdict(set))

# maintain a record of tuple-indices that contain each character in each position
for i,t in enumerate(L):
    for c,e in enumerate(t):
        if not e: continue
        keys[e][c].add(i)

delme = set()
for i,t in enumerate(L):
    collocs = set.intersection(*[keys[e][c] for c,e in enumerate(t) if e])
    if len(collocs)>1:  # if all characters appear in this position in >1 index
        # ignore the collocation with the most non-empty characters
        # mark the rest for deletion
        C = max(collocs, key=lambda i: sum(bool(e) for bool in L[i]))
        for c in collocs:
            if c!=C: delme.add(c)

filtered = [t for i,t in enumerate(L) if i not in delme]

- inspectorG4dget

应该是 sum(bool(e) for e in L[i]))。 - dawg

4

这些字符串总是在同一个位置，因此我用布尔值替换它们，以便更容易地进行比较。首先进行排序，然后只保留元素，如果与所有其他元素相比，前一个元素无论在哪里都为 true 或与后续元素相同。然后完成比较后，将其从列表中删除。

f = sorted(map(lambda x: list(map(bool, x)), l), key=sum, reverse=True)

to_keep = []

while len(f) > 1:
    if all(map(lambda x, y: True if x == y or x else False, f[0], f[1])):
        to_keep.append(len(l) - len(f) + 1)
    f = f[1:]

print([l[i] for i in to_keep])

[('A', 'B', 'C', ''), ('A', '', '', 'D')]

在43.7微秒内，它的速度也是最高票答案的两倍。

- Nicolas Gervais

这对我不起作用。例如：l = [(A,'',C),('',B,C),('','',C),(A,'',''),('','',''),(A,B,''),('',B,'')]您的解决方案仅返回[('',B,'')]，它应该返回[(A,'',C),('',B,C),(A,B,'')]。 - kspr

1

考虑每个序列都是一个集合。现在我们只需丢弃所有子集。

给定：

import itertools as it


expected = {("A", "B", "C", ""), ("A", "", "", "D")}
data = [
    ("A", "B", "", ""),
    ("A", "B", "C", ""), 
    ("", "", "", "D"), 
    ("A", "", "", "D"), 
    ("", "B", "", "")
]

代码

一种将集合转换并比较的迭代解决方案。

def discard_subsets(pool: list) -> set:
    """Return a set without subsets."""
    discarded = set()

    for n, k in it.product(pool, repeat=2):                 # 1

        if set(k) < set(n)):                                # 2
            discarded.add(k)

    return set(pool) - discarded                            # 3

一种类似的单行解决方案。

set(data) - {k for n, k in it.product(data, repeat=2) if set(k) < set(n)}

演示

discard_subsets(data)
# {('A', '', '', 'D'), ('A', 'B', 'C', '')}

细节

后面的函数是有注释的，以帮助解释每个部分：

将所有元素相互比较（或使用嵌套循环）。
如果一个元素是一个真子集（见下文），则舍弃它。
从池中删除被舍弃的元素。

为什么要使用集合？

池中的每个元素都可以是一个集合，因为相关的子元素是唯一的，即 "A"，"B"，"C"，"D"，""。

集合具有成员属性。因此，例如，

("A", "B", "", "") 具有 ("A", "B", "C", "") 中的所有值

也可以表示为

集合 {"A", "B", "", ""} 是集合 {"A", "B", "C", ""} 的子集

现在只需要比较所有元素并拒绝所有 真子集。

a, a_, ac = {"a"}, {"a"}, {"a", "c"}

# Subsets
assert a.issubset(a_)                                       
assert a <= a_
assert a <= ac

# Proper subsets
assert not a < a_
assert a < ac

复杂度

由于我们基本上有嵌套循环，最好的情况下我们得到O(n^2)的复杂度。这可能不是最有效的方法，但希望足够清晰易懂。

测试

f = discard_subsets
assert {("A", "B", "C", "")} == f([("A", "B", "", ""), ("A", "B", "C", "")])
assert {("A", "B", "C", "")} == f([("A", "B", "C", ""), ("A", "B", "", "")])
assert {("A", "B", "C", ""), ("", "", "", "D")} == f([("A", "B", "", ""), ("A", "B", "C", ""), ("", "", "", "D")])
assert {("A", "B", "C", ""), ("", "", "", "D")} == f([("", "", "", "D"), ("A", "B", "", ""), ("A", "B", "C", "")])
assert {("A", "B", "C", ""), ("", "", "", "D")} == f([("A", "B", "C", ""), ("", "", "", "D"), ("A", "B", "", "")])
assert {("A", "B", "C", ""), ("", "", "", "D")} == f([("A", "B", "C", ""), ("A", "B", "", ""), ("", "", "", "D")])
assert {("A","","C"), ("","B","C"), ("A","B","")} == f([("A","","C"),("","B","C"),("","","C"),("A","",""),("","",""),("A","B",""),("","B","")])
assert set(expected) == f(data)

- pylang

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Green Cloak Guy · Accepted Answer

我们可以将每个元组概念化为一个二进制数组，其中1表示“包含某些内容”，2表示“包含空字符串”。由于在每个位置上的项目都是相同的，因此我们不需要关心每个位置上的具体内容，只需要知道是否有内容存在。

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
l_bin = [sum(2**i if k else 0 for i,k in enumerate(tup)) for tup in l]
# [3, 7, 8, 9, 2]
# [0b0011, 0b0111, 0b1000, 0b1001, 0b0010]
# that it's backwards doesn't really matter, since it's consistent

现在，我们可以遍历这个列表并构建一个没有“重复项”的新数据结构。由于我们的元组已被编码为二进制，因此我们可以通过执行按位运算来确定是否有重复项被“包含”在另一个元组中 - 给定a和b，如果a | b == a，那么a必须包含b。

codes = {}
for tup, b in zip(l, l_bin):
    # check if any existing code contains the potential new one
    # in this case, skip adding the new one
    if any(a | b == a for a in codes):
        continue
    # check if the new code contains a potential existing one or more
    # in which case, replace the existing code(s) with the new code
    for a in list(codes):
        if b | a == b:
            codes.pop(a)
    # and finally, add this code to our datastructure
    codes[b] = tup

现在我们可以撤回我们的“筛选过”的元组列表：

output = list(codes.values())
# [('A', 'B', 'C', ''), ('A', '', '', 'D')]

请注意，(A, B, C, '') 包含了 (A, B, '', '') 和 ('', B, '', '')，而 (A, '', '', D') 包含了 ('', '', '', D)，因此这应该是正确的。

从 python 3.8 开始，dict 会保留插入顺序，因此输出应该与元组最初出现在列表中的顺序相同。

该解决方案可能不完全高效，因为代码数量可能会堆积，但它应该介于 O(n) 和 O(n^2) 之间，具体取决于最后剩余的唯一代码数量（由于每个元组的长度明显小于 l 的长度，因此它应该更接近于 O(n) 而不是 O(n^2)。