如何在列表中找到重复项并创建另一个包含它们的列表？

Question

如何在列表中找到重复项并创建另一个包含它们的列表？

pythonlistduplicates

697

如何在整数列表中找到重复项并创建另一个包含这些重复项的列表？

- MFB

2

可能是重复的问题：如何在保留顺序的情况下从Python列表中删除重复项？ - DhruvPathak

3

你希望重复出现的内容只保留一次，还是每次看到都要重复？ - moooeeeep

我认为这个问题已经得到了更高效的回答。https://dev59.com/wnRB5IYBdhLWcg3wXmRI#642919 intersection是一个集合的内置方法，应该可以完全满足需求。 - Tom Smith

44个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Julien Perrenoud · Answer 1

一些解决方案要么具有二次复杂性，要么冗长，或者需要第三方库。这里是一个简单的、两行的答案，使用“已见集合”策略，使时间复杂度保持线性，同时只增加了两倍的内存。

def duplicates(array):
    seen = set()
    return { val for val in array if (val in seen or seen.add(val)) }

---
duplicates(["a", "b", "c", "a", "d"])
> {'a'}

它是如何工作的

val in seen 如果seen包含该值，则为真，因此会产生返回表达式中的元素
否则，or seen.add(val)将元素添加到已见集合中（不在列表中包括，因为它返回None，在这个上下文中表示为False）
所有内容都包裹在一个集合推导式中，确保我们只返回每个重复元素一次。

- Baris Ozensel · Answer 2

另一种解决方案是不使用任何集合库，如下所示。

a = [1,2,3,5,4,6,4,21,4,6,3,32,5,2,23,5]
duplicates = []

for i in a:
    if a.count(i) > 1 and i not in duplicates:
        duplicates.append(i)

print(duplicates)

输出结果为[2, 3, 5, 4, 6]。

- Ravikiran D · Answer 3

使用list.count()方法可以在列表中查找重复元素。

最初的回答

arr=[]
dup =[]
for i in range(int(input("Enter range of list: "))):
    arr.append(int(input("Enter Element in a list: ")))
for i in arr:
    if arr.count(i)>1 and i not in dup:
        dup.append(i)
print(dup)

- Wizr · Answer 4

这是一个单行代码，用于娱乐或在需要单个语句的情况下使用。

(lambda iterable: reduce(lambda (uniq, dup), item: (uniq, dup | {item}) if item in uniq else (uniq | {item}, dup), iterable, (set(), set())))(some_iterable)

- sergzach · Answer 5

一些其他的测试。当然需要做...

set([x for x in l if l.count(x) > 1])

...太昂贵了。使用下一个最终方法可以快大约500倍（长度更长的数组效果更好）：

def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()

只有2个循环，没有非常昂贵的l.count()操作。

下面是一个比较方法的代码示例。代码如下，这里是输出：

dups_count: 13.368s # this is a function which uses l.count()
dups_count_dict: 0.014s # this is a final best function (of the 3 functions)
dups_count_counter: 0.024s # collections.Counter

测试代码:

import numpy as np
from time import time
from collections import Counter

class TimerCounter(object):
    def __init__(self):
        self._time_sum = 0

    def start(self):
        self.time = time()

    def stop(self):
        self._time_sum += time() - self.time

    def get_time_sum(self):
        return self._time_sum


def dups_count(l):
    return set([x for x in l if l.count(x) > 1])


def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()


def dups_counter(l):
    counter = Counter(l)    

    result_d = {key: val for key, val in counter.iteritems() if val > 1}

    return result_d.keys()



def gen_array():
    np.random.seed(17)
    return list(np.random.randint(0, 5000, 10000))


def assert_equal_results(*results):
    primary_result = results[0]
    other_results = results[1:]

    for other_result in other_results:
        assert set(primary_result) == set(other_result) and len(primary_result) == len(other_result)


if __name__ == '__main__':
    dups_count_time = TimerCounter()
    dups_count_dict_time = TimerCounter()
    dups_count_counter = TimerCounter()

    l = gen_array()

    for i in range(3):
        dups_count_time.start()
        result1 = dups_count(l)
        dups_count_time.stop()

        dups_count_dict_time.start()
        result2 = dups_count_dict(l)
        dups_count_dict_time.stop()

        dups_count_counter.start()
        result3 = dups_counter(l)
        dups_count_counter.stop()

        assert_equal_results(result1, result2, result3)

    print 'dups_count: %.3f' % dups_count_time.get_time_sum()
    print 'dups_count_dict: %.3f' % dups_count_dict_time.get_time_sum()
    print 'dups_count_counter: %.3f' % dups_count_counter.get_time_sum()

- All Іѕ Vаиітy · Answer 6

raw_list = [1,2,3,3,4,5,6,6,7,2,3,4,2,3,4,1,3,4,]

clean_list = list(set(raw_list))
duplicated_items = []

for item in raw_list:
    try:
        clean_list.remove(item)
    except ValueError:
        duplicated_items.append(item)


print(duplicated_items)
# [3, 6, 2, 3, 4, 2, 3, 4, 1, 3, 4]

你可以通过将列表转换为集合(clean_list)来删除重复项，然后在遍历raw_list的同时，从clean_list中删除每个item以查找其在raw_list中的出现。如果未找到item，则会捕获引发的ValueError异常，并将item添加到duplicated_items列表中。

如果需要重复项的索引，只需对列表进行枚举(for index, item in enumerate(raw_list):)即可。这种方法适用于大型列表(如数千个元素)，速度更快、更优化。

- Haresh Shyara · Answer 7

list2 = [1, 2, 3, 4, 1, 2, 3]
lset = set()
[(lset.add(item), list2.append(item))
 for item in list2 if item not in lset]
print list(lset)

- Andreas Profous · Answer 8

在使用toolz时：

from toolz import frequencies, valfilter

a = [1,2,2,3,4,5,4]
>>> list(valfilter(lambda count: count > 1, frequencies(a)).keys())
[2,4]

- tvt173 · Answer 9

这里有很多答案，但我认为这是相对非常易读和易于理解的方法：

def get_duplicates(sorted_list):
    duplicates = []
    last = sorted_list[0]
    for x in sorted_list[1:]:
        if x == last:
            duplicates.append(x)
        last = x
    return set(duplicates)

注意：

如果你想保留重复计数，请在底部去掉对“set”的转换以获取完整列表
如果您更喜欢使用生成器，请将duplicates.append(x)替换为yield x，并将底部的返回语句替换为（稍后可以转换为set）

- rassa45 · Answer 10

一行解决方案：

set([i for i in list if sum([1 for a in list if a == i]) > 1])