如何将两个列表合并成唯一的列表

Question

如何将两个列表合并成唯一的列表

5

我正在处理非常长的列表，并尝试提出一种迭代解决方案，以独特的方式合并这两个列表。

例如，我有以下列表：

a = [TF1,Tar1]
b = [Tar1, TF1]

我希望以下迭代器（如有可能）包含元组：

(TF1,Tar1)    
(TF1,TF1)  
(Tar1,Tar1)

这不包括（Tar1，TF1），因为相反的排序已经被添加。

我的当前方法是循环遍历每个列表，并使用字典来跟踪已添加的内容。由于a列表有12000项，b列表有15000项，因此这占用了大量RAM。结果字典包含约a*b/2个条目，在这种情况下为90M个条目。

欢迎提出任何建议。谢谢

- user3417525

2

一个列表是否可以有重复元素？例如：a = [TF1，Tar1，TF1] - Gargamel

@Gargamel 请看他的例子。 - simonzack

2

我看了，但它没有回答我的问题，除非我漏看了什么？ - Gargamel

结果的顺序重要吗？ - Blckknght

你想要输出二元组（对）还是更长的元组？你说你的列表长度分别为12000和15000。 - Totem

4个回答

1

为了迭代生成配对，您需要查看itertools.product函数：

>>> l1 = [1, 2, 3]
>>> l2 = [1, 3, 7]
>>> import itertools
>>> list(itertools.product(l1, l2))
[(1, 1), (1, 3), (1, 7), (2, 1), (2, 3), (2, 7), (3, 1), (3, 3), (3, 7)]

然而，我认为在不追踪已经查看过的元素的情况下，无法删除重复的对。

要在内存中删除重复项，我会将元组排序并将其设置为集合：

>>> pairs = list(itertools.product(l1, l2))
>>> set(map(tuple, map(sorted, pairs)))
set([(1, 2), (2, 7), (1, 3), (3, 3), (2, 3), (1, 7), (3, 7), (1, 1)])

如果您想保持内存低并且可以使用磁盘，我建议使用类似于this approach的基于磁盘文件支持的合并排序。在迭代itertools.product的结果时，对成对数据进行排序并写入磁盘。然后使用合并排序并读取已排序的列表，删除重复项（因为它们将是相邻的）。

- jterrace

1

我认为你可以避免存储到目前为止生成的所有值而不重复。相反，您需要检查哪些生成的值将在以后被反向生成，并仅跟踪这些项目。如果碰撞数量不大，则这将需要较少的内存（尽管在最坏情况下仍为O(M*N)）。

以下是我如何做到这一点：

import itertools

def product_without_reversed_duplicates(a, b):
    a_set = set(a)
    b_set = set(b)
    dupes = set()

    for x, y in itertools.product(a, b):
        if (x, y) not in dupes: # take (x, y) only if it is not a dupe of a previous item
            yield x, y
            if x in b_set and y in a_set:  # test if (y, x) will be generated later
                dupes.add((y, x))          # if so, add it to the set to be skipped

请注意，这假定a和b没有任何内部重复项，并且您希望尽可能保留产品的顺序（仅跳过反向对）。如果a或b中有可能存在重复项，则应迭代itertools.product(a_set, b_set)而不是上面的方法。然而，这将以任意顺序给出结果。您可以通过额外的步骤来去重a和b并保持它们的顺序来解决这个问题，但如果需要的话，我会让您自己找出代码。

- Blckknght

1

相当棘手，但这里有一种方法可以做到，需要额外的O(n)内存。

xs = ['a', 'b', 'd']
ys = ['b', 'a', 'c']

def unique(seq):
    seen = set()
    seen_add = seen.add
    return [ x for x in seq if not (x in seen or seen_add(x))]

xs = unique(xs)
ys = unique(ys)

x_added = set()
for x in xs:
    for y in ys:
        if y in x_added and x in set(ys):
            continue
        print(x, y)
    x_added.add(x)

输出：

a b
a a
a c
b b
b c
d b
d a
d c

基本上，如果 y 已经在先前已产生的某个 x 中，则我们知道已经生成了一对，并且 x 是其中一个 ys，因为我们已经迭代了所有先前的 x 的所有 y。唯一的要求只是使处理特殊情况更容易。

- simonzack

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Abhijit · Accepted Answer

基本上，问题出现在两个列表之间的共同元素。如果您可以区分合并共同和独特元素的情况，您将解决您的问题，即您需要创建以下笛卡尔积。

a_unique X b_unique
a_unique X b_common
a_common X b_unique
a_common X b_common

在这四种情况中，最后一种情况会导致非唯一的配对。再想一想，最后一个带有唯一配对的笛卡尔积只是从a_common中简单选择2个元素。

最后，通过创建一个集合并将两个列表放入其中，然后迭代比较来进行元素的分离。

>>> #Sample Lists
>>> a = ['C0','C1','C2','A0','A1','A2']
>>> b = ['C0','C1','C2','B0','B1','B2']
>>> from itertools import product, combinations, chain
>>> # Create sets for O(1) lookup
>>> a_key = set(a)
>>> b_key = set(b)
>>> # Segerate elements to unique and common for both lists
>>> a = {'common':a_key & b_key,
         'unique':a_key - common}
>>> b = {'common':a_key & b_key,
         'unique':b_key - common}
>>> # Create cartesian products forall the cases
>>> list(chain.from_iterable([product(a['unique'], b['unique']),
                      product(a['unique'], b['common']),
                      product(a['common'], b['unique']),
                      combinations(a['common'], 2)]))
[('A0', 'B0'), ('A0', 'B1'), ('A0', 'B2'), ('A1', 'B0'), ('A1', 'B1'), ('A1', 'B2'), ('A2', 'B0'), ('A2', 'B1'), ('A2', 'B2'), ('A0', 'C0'), ('A0', 'C1'), ('A0', 'C2'), ('A1', 'C0'), ('A1', 'C1'), ('A1', 'C2'), ('A2', 'C0'), ('A2', 'C1'), ('A2', 'C2'), ('C0', 'B0'), ('C0', 'B1'), ('C0', 'B2'), ('C1', 'B0'), ('C1', 'B1'), ('C1', 'B2'), ('C2', 'B0'), ('C2', 'B1'), ('C2', 'B2'), ('C0', 'C1'), ('C0', 'C2'), ('C1', 'C2')]