如何在列表中找到重复项并创建另一个包含它们的列表？

Question

如何在列表中找到重复项并创建另一个包含它们的列表？

pythonlistduplicates

697

如何在整数列表中找到重复项并创建另一个包含这些重复项的列表？

- MFB

2

可能是重复的问题：如何在保留顺序的情况下从Python列表中删除重复项？ - DhruvPathak

3

你希望重复出现的内容只保留一次，还是每次看到都要重复？ - moooeeeep

我认为这个问题已经得到了更高效的回答。https://dev59.com/wnRB5IYBdhLWcg3wXmRI#642919 intersection是一个集合的内置方法，应该可以完全满足需求。 - Tom Smith

44个回答

455

一个非常简单的解决方案，但是时间复杂度为O(n*n)。

>>> xs = [1,2,3,4,4,5,5,6,1]
>>> set([x for x in xs if xs.count(x) > 1])
set([1, 4, 5])

- Ritesh Kumar

5

你为什么要使用列表推导式而不是生成器推导式？ - user3892448

114

的确，这是一个简单的解决方案，但由于每个count()函数都要重新遍历整个列表，所以复杂度会呈平方级增长，因此不适合处理大型列表。 - danuker

4

@JohnJ，冒泡排序也很简单并且可用。但这并不意味着我们应该使用它！ - John La Rooy

1

@watsonic：你的“简单开关”未能将时间复杂度从二次降低到平方级别，无法在普遍情况下实现。仅用 set(l) 替换 l 只能降低最坏时间复杂度，对于答案的大规模效率问题没有任何作用。这可能并不是那么简单。简而言之，不要这样做。 - Cecil Curry

3

不要这样做。 - user3064538

显示剩余5条评论

104

您不需要计数，只需知道该项是否曾经被查看过。将该答案改编为此问题：

def list_duplicates(seq):
  seen = set()
  seen_add = seen.add
  # adds all elements it doesn't know yet to seen and all other to seen_twice
  seen_twice = set( x for x in seq if x in seen or seen_add(x) )
  # turn the set into a list (as requested)
  return list( seen_twice )

a = [1,2,3,2,1,5,6,5,5,5]
list_duplicates(a) # yields [1, 2, 5]

如果速度很重要，以下是一些时间:

# file: test.py
import collections

def thg435(l):
    return [x for x, y in collections.Counter(l).items() if y > 1]

def moooeeeep(l):
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    seen_twice = set( x for x in l if x in seen or seen_add(x) )
    # turn the set into a list (as requested)
    return list( seen_twice )

def RiteshKumar(l):
    return list(set([x for x in l if l.count(x) > 1]))

def JohnLaRooy(L):
    seen = set()
    seen2 = set()
    seen_add = seen.add
    seen2_add = seen2.add
    for item in L:
        if item in seen:
            seen2_add(item)
        else:
            seen_add(item)
    return list(seen2)

l = [1,2,3,2,1,5,6,5,5,5]*100

这里是结果：（好样的@JohnLaRooy！）

$ python -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
10000 loops, best of 3: 74.6 usec per loop
$ python -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 91.3 usec per loop
$ python -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 266 usec per loop
$ python -mtimeit -s 'import test' 'test.RiteshKumar(test.l)'
100 loops, best of 3: 8.35 msec per loop

有趣的是，除了时间之外，当使用pypy时，排名也略微改变。最有趣的是，基于计数器的方法从pypy的优化中受益匪浅，而我建议的方法缓存方法似乎几乎没有影响。

$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
100000 loops, best of 3: 17.8 usec per loop
$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
10000 loops, best of 3: 23 usec per loop
$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 39.3 usec per loop

显然，这种效应与输入数据的“重复性”有关。我设置了l = [random.randrange(1000000) for i in xrange(10000)]并得到了以下结果：

$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
1000 loops, best of 3: 495 usec per loop
$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
1000 loops, best of 3: 499 usec per loop
$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 1.68 msec per loop

- moooeeeep

8

好的，seen_add = seen.add 的目的是什么？ - serverpunk

4

这样，您只需调用之前查找过的函数即可。否则，每次需要插入内容时，都需要查找（查询字典）成员函数add。 - moooeeeep

使用自己的数据和IPython的%timeit测试了一下，你的方法在测试中看起来确实是最快的，但是：“最慢的运行时间比最快的运行时间长了4.34倍。这可能意味着一个中间结果被缓存了”。 - Joop

1

@moooeeeep，我为你的脚本添加了另一个版本供你尝试 :) 如果你手头有pypy并且想要提高速度，请也尝试一下。 - John La Rooy

@JohnLaRooy，性能有了很大的提升！有趣的是，当我使用pypy来评估结果时，基于Counter的方法显著改善了。 - moooeeeep

87

您可以使用 iteration_utilities.duplicates 工具来完成此操作：

>>> from iteration_utilities import duplicates

>>> list(duplicates([1,1,2,1,2,3,4,2]))
[1, 1, 2, 2]

如果您只需要每个重复项中的一个，可以将此代码与iteration_utilities.unique_everseen组合使用:

>>> from iteration_utilities import unique_everseen

>>> list(unique_everseen(duplicates([1,1,2,1,2,3,4,2])))
[1, 2]

它也可以处理不可哈希的元素（但牺牲了性能）：

>>> list(duplicates([[1], [2], [1], [3], [1]]))
[[1], [1]]

>>> list(unique_everseen(duplicates([[1], [2], [1], [3], [1]])))
[[1]]

这是仅有少数其他方法才能处理的内容。

基准测试

我做了一个包含大部分(但不是全部)提到过的方法的简短基准测试。

第一个基准测试只包括一小段列表长度，因为一些方法具有 O(n**2) 行为。

在图表中，y轴表示时间，所以较低的值意味着更好。它还绘制了对数对数，因此可以更好地可视化广泛的值范围：

除去 O(n**2) 的方法，我又进行了一个500000个元素的列表基准测试：

正如你所看到的，iteration_utilities.duplicates 方法比其他任何方法都快，甚至链接 unique_everseen(duplicates(...)) 也比其他方法快或同样快。

这里额外需要注意的一件事是，Pandas方法对于小列表非常慢，但对于长列表可以轻松竞争。

然而，正如这些基准测试所显示的，大多数方法的执行效果都差不多，所以使用哪种方法并不重要(除了3个具有 O(n**2) 运行时间的方法)。

from iteration_utilities import duplicates, unique_everseen
from collections import Counter
import pandas as pd
import itertools

def georg_counter(it):
    return [item for item, count in Counter(it).items() if count > 1]

def georg_set(it):
    seen = set()
    uniq = []
    for x in it:
        if x not in seen:
            uniq.append(x)
            seen.add(x)

def georg_set2(it):
    seen = set()
    return [x for x in it if x not in seen and not seen.add(x)]   

def georg_set3(it):
    seen = {}
    dupes = []

    for x in it:
        if x not in seen:
            seen[x] = 1
        else:
            if seen[x] == 1:
                dupes.append(x)
            seen[x] += 1

def RiteshKumar_count(l):
    return set([x for x in l if l.count(x) > 1])

def moooeeeep(seq):
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    seen_twice = set( x for x in seq if x in seen or seen_add(x) )
    # turn the set into a list (as requested)
    return list( seen_twice )

def F1Rumors_implementation(c):
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in zip(a, b):
        if k != g: continue
        if k != r:
            yield k
            r = k

def F1Rumors(c):
    return list(F1Rumors_implementation(c))

def Edward(a):
    d = {}
    for elem in a:
        if elem in d:
            d[elem] += 1
        else:
            d[elem] = 1
    return [x for x, y in d.items() if y > 1]

def wordsmith(a):
    return pd.Series(a)[pd.Series(a).duplicated()].values

def NikhilPrabhu(li):
    li = li.copy()
    for x in set(li):
        li.remove(x)

    return list(set(li))

def firelynx(a):
    vc = pd.Series(a).value_counts()
    return vc[vc > 1].index.tolist()

def HenryDev(myList):
    newList = set()

    for i in myList:
        if myList.count(i) >= 2:
            newList.add(i)

    return list(newList)

def yota(number_lst):
    seen_set = set()
    duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
    return seen_set - duplicate_set

def IgorVishnevskiy(l):
    s=set(l)
    d=[]
    for x in l:
        if x in s:
            s.remove(x)
        else:
            d.append(x)
    return d

def it_duplicates(l):
    return list(duplicates(l))

def it_unique_duplicates(l):
    return list(unique_everseen(duplicates(l)))

基准测试1

from simple_benchmark import benchmark
import random

funcs = [
    georg_counter, georg_set, georg_set2, georg_set3, RiteshKumar_count, moooeeeep, 
    F1Rumors, Edward, wordsmith, NikhilPrabhu, firelynx,
    HenryDev, yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
]

args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 12)}

b = benchmark(funcs, args, 'list size')

b.plot()

基准测试二

funcs = [
    georg_counter, georg_set, georg_set2, georg_set3, moooeeeep, 
    F1Rumors, Edward, wordsmith, firelynx,
    yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
]

args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 20)}

b = benchmark(funcs, args, 'list size')
b.plot()

免责声明

^{1 这是我编写的第三方库：iteration_utilities提供的内容。}

- MSeifert

4

我会尽力进行翻译，原文的意思是建议编写一个专门的C库来完成工作，而不是用Python，这可能不是寻求的答案精神，但这是一种合法的方法。我喜欢这个答案的广度和结果的图形显示——很高兴看到它们正在收敛，让我想知道当输入继续增加时它们是否会交叉！问题是：与完全随机列表相比，大多数排序列表的结果是什么？ - F1Rumors

谢谢提供基准测试数据，我已经在我的回答中使用了。对于这样拥挤的图像，建议使用更高的分辨率（我保存我的时候用了 plt.savefig('duplicates.png', dpi=300)），并删除实际绘图周围的大部分空白区域。然后图像会更大、更清晰，在嵌入式视图和独立视图（单击图像时）中都能看到。更高的分辨率会增加文件大小，但可以通过减少颜色深度来抵消这一点。 - superb rain

36

在查找相关内容时，我遇到了这个问题，想知道为什么没有人提供基于生成器的解决方案？解决这个问题将会是：

>>> print list(getDupes_9([1,2,3,2,1,5,6,5,5,5]))
[1, 2, 5]

我关注可扩展性，因此测试了几种方法，包括原始方法在小列表上表现良好，但随着列表变得更大而变得非常糟糕（注意-最好使用timeit，但这是为了说明问题）。

为了进行比较，我包括了 @moooeeeep 的方法（如果输入列表完全随机，则速度最快），以及一个 itertools 方法，在大多数排序列表中速度更快... 现在还包括来自@firelynx的 pandas 方法 - 速度慢，但不算太糟糕，而且简单。注意- 在我的机器上，对于大多数有序列表，sort/tee/zip 方法始终是最快的，对于洗牌列表，moooeeeep 是最快的，但您可能会有所不同。

优点

使用相同的代码非常快速简单地测试“任何”重复项

假设

应报告重复项仅一次
不需要保留重复项顺序
重复项可能出现在列表的任何位置

最快的解决方案，1m 条目:

def getDupes(c):
        '''sort/tee/izip'''
        a, b = itertools.tee(sorted(c))
        next(b, None)
        r = None
        for k, g in itertools.izip(a, b):
            if k != g: continue
            if k != r:
                yield k
                r = k

测试过的方法

import itertools
import time
import random

def getDupes_1(c):
    '''naive'''
    for i in xrange(0, len(c)):
        if c[i] in c[:i]:
            yield c[i]

def getDupes_2(c):
    '''set len change'''
    s = set()
    for i in c:
        l = len(s)
        s.add(i)
        if len(s) == l:
            yield i

def getDupes_3(c):
    '''in dict'''
    d = {}
    for i in c:
        if i in d:
            if d[i]:
                yield i
                d[i] = False
        else:
            d[i] = True

def getDupes_4(c):
    '''in set'''
    s,r = set(),set()
    for i in c:
        if i not in s:
            s.add(i)
        elif i not in r:
            r.add(i)
            yield i

def getDupes_5(c):
    '''sort/adjacent'''
    c = sorted(c)
    r = None
    for i in xrange(1, len(c)):
        if c[i] == c[i - 1]:
            if c[i] != r:
                yield c[i]
                r = c[i]

def getDupes_6(c):
    '''sort/groupby'''
    def multiple(x):
        try:
            x.next()
            x.next()
            return True
        except:
            return False
    for k, g in itertools.ifilter(lambda x: multiple(x[1]), itertools.groupby(sorted(c))):
        yield k

def getDupes_7(c):
    '''sort/zip'''
    c = sorted(c)
    r = None
    for k, g in zip(c[:-1],c[1:]):
        if k == g:
            if k != r:
                yield k
                r = k

def getDupes_8(c):
    '''sort/izip'''
    c = sorted(c)
    r = None
    for k, g in itertools.izip(c[:-1],c[1:]):
        if k == g:
            if k != r:
                yield k
                r = k

def getDupes_9(c):
    '''sort/tee/izip'''
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in itertools.izip(a, b):
        if k != g: continue
        if k != r:
            yield k
            r = k

def getDupes_a(l):
    '''moooeeeep'''
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    for x in l:
        if x in seen or seen_add(x):
            yield x

def getDupes_b(x):
    '''iter*/sorted'''
    x = sorted(x)
    def _matches():
        for k,g in itertools.izip(x[:-1],x[1:]):
            if k == g:
                yield k
    for k, n in itertools.groupby(_matches()):
        yield k

def getDupes_c(a):
    '''pandas'''
    import pandas as pd
    vc = pd.Series(a).value_counts()
    i = vc[vc > 1].index
    for _ in i:
        yield _

def hasDupes(fn,c):
    try:
        if fn(c).next(): return True    # Found a dupe
    except StopIteration:
        pass
    return False

def getDupes(fn,c):
    return list(fn(c))

STABLE = True
if STABLE:
    print 'Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array'
else:
    print 'Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array'
for location in (50,250000,500000,750000,999999):
    for test in (getDupes_2, getDupes_3, getDupes_4, getDupes_5, getDupes_6,
                 getDupes_8, getDupes_9, getDupes_a, getDupes_b, getDupes_c):
        print 'Test %-15s:%10d - '%(test.__doc__ or test.__name__,location),
        deltas = []
        for FIRST in (True,False):
            for i in xrange(0, 5):
                c = range(0,1000000)
                if STABLE:
                    c[0] = location
                else:
                    c.append(location)
                    random.shuffle(c)
                start = time.time()
                if FIRST:
                    print '.' if location == test(c).next() else '!',
                else:
                    print '.' if [location] == list(test(c)) else '!',
                deltas.append(time.time()-start)
            print ' -- %0.3f  '%(sum(deltas)/len(deltas)),
        print
    print

“全部重复项”测试的结果是一致的，它在这个数组中找到了“第一个”重复项，然后是“所有”重复项。”

Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array
Test set len change :    500000 -  . . . . .  -- 0.264   . . . . .  -- 0.402  
Test in dict        :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.250  
Test in set         :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.249  
Test sort/adjacent  :    500000 -  . . . . .  -- 0.159   . . . . .  -- 0.229  
Test sort/groupby   :    500000 -  . . . . .  -- 0.860   . . . . .  -- 1.286  
Test sort/izip      :    500000 -  . . . . .  -- 0.165   . . . . .  -- 0.229  
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.145   . . . . .  -- 0.206  *
Test moooeeeep      :    500000 -  . . . . .  -- 0.149   . . . . .  -- 0.232  
Test iter*/sorted   :    500000 -  . . . . .  -- 0.160   . . . . .  -- 0.221  
Test pandas         :    500000 -  . . . . .  -- 0.493   . . . . .  -- 0.499

当列表首先被洗牌时，排序的价格变得明显——效率显著下降，@moooeeeep的方法占主导地位，而set和dict方法相似但表现较差：

Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array
Test set len change :    500000 -  . . . . .  -- 0.321   . . . . .  -- 0.473  
Test in dict        :    500000 -  . . . . .  -- 0.285   . . . . .  -- 0.360  
Test in set         :    500000 -  . . . . .  -- 0.309   . . . . .  -- 0.365  
Test sort/adjacent  :    500000 -  . . . . .  -- 0.756   . . . . .  -- 0.823  
Test sort/groupby   :    500000 -  . . . . .  -- 1.459   . . . . .  -- 1.896  
Test sort/izip      :    500000 -  . . . . .  -- 0.786   . . . . .  -- 0.845  
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.743   . . . . .  -- 0.804  
Test moooeeeep      :    500000 -  . . . . .  -- 0.234   . . . . .  -- 0.311  *
Test iter*/sorted   :    500000 -  . . . . .  -- 0.776   . . . . .  -- 0.840  
Test pandas         :    500000 -  . . . . .  -- 0.539   . . . . .  -- 0.540

- F1Rumors

2

这个答案非常好。我不明白为什么它的解释和测试没有更多的分数，这对那些需要它的人非常有用。 - dlewin

1

Python的排序在只有一个项目未排序时为O(n)。您应该使用random.shuffle(c)来解决这个问题。此外，当我运行未更改的脚本时，无法复制您的结果（完全不同的排序），因此可能也取决于CPU。 - John La Rooy

谢谢@John-La-Rooy，对于CPU/本地机器的敏锐观察 - 所以我应该修改项目_YYMV_。使用**O(n)**排序是有意为之的：重复元素被插入到不同的位置，特别是在这些方法中，如果在好的（列表开头）或坏的（列表结尾）位置有唯一的重复项，则可以看到方法的影响。我考虑过一个随机列表 - 例如random.shuffle - 但决定只有在我做了更多的运行时才有意义！我将不得不返回并基准测试多次运行/洗牌等效性，并查看其影响。 - F1Rumors

2

itertools.tee 添加了不必要的开销。尝试使用 c = sorted(c)，然后使用 a, b = iter(c), iter(c)。 - superb rain

@superb_rain 我刚试了一下。你的使用情况可能会有所不同，但在我的空间中，这显示出明显的收益：你是正确的，tee 的开销比控制自己的迭代器更昂贵，一旦它被实现。很好发现！ - F1Rumors

显示剩余4条评论

23

使用pandas：

>>> import pandas as pd
>>> a = [1, 2, 1, 3, 3, 3, 0]
>>> pd.Series(a)[pd.Series(a).duplicated()].values
array([1, 3, 3])

- wellplayed

14

这里有一个简洁而巧妙的解决方案 -

for x in set(li):
    li.remove(x)

li = list(set(li))

- Nikhil Prabhu

原始列表已经丢失，但可以通过将内容复制到另一个列表来解决 - temp = li[:]。 - Nikhil Prabhu

6

在大型列表上删除元素是非常费时的，这是一个相当棘手的练习！ - F1Rumors

基本上，您需要删除列表中所有不是重复项的内容，最终只剩下重复项。 - lolololol ol

基本上，你要移除列表中所有不重复的元素，最后只剩下重复的元素。 - lolololol ol

13

如果您不想编写自己的算法或使用库，Python 3.8 的一行代码：

l = [1,2,3,2,1,5,6,5,5,5]

res = [(x, count) for x, g in groupby(sorted(l)) if (count := len(list(g))) > 1]

print(res)

打印物品和数量：

[(1, 2), (2, 2), (5, 4)]

groupby 接受一个分组函数，因此您可以以不同的方式定义您的分组，并根据需要返回附加的 Tuple 字段。

- yǝsʞǝla

1

懒惰与速度（吞吐量方面）无关，特别是在这种情况下，res 是一个渴望的列表。 - Henry Henrinson

5

这很好，但应该提到你需要 from itertools import groupby 来完成这个。 - Hansang

11

collections.Counter是Python 2.7中的新功能：


Python 2.5.4 (r254:67916, May 31 2010, 15:03:39) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
a = [1,2,3,2,1,5,6,5,5,5]
import collections
print [x for x, y in collections.Counter(a).items() if y > 1]
Type "help", "copyright", "credits" or "license" for more information.
  File "", line 1, in 
AttributeError: 'module' object has no attribute 'Counter'
>>>

在早期版本中，您可以使用传统字典来代替：

a = [1,2,3,2,1,5,6,5,5,5]
d = {}
for elem in a:
    if elem in d:
        d[elem] += 1
    else:
        d[elem] = 1

print [x for x, y in d.items() if y > 1]

- Edward

请注意，在Python 3中，您需要在print语句中使用括号。 - Mark Chackerian

10

我认为在列表中查找重复项最有效的方法是：

from collections import Counter

def duplicates(values):
    dups = Counter(values) - Counter(set(values))
    return list(dups.keys())

print(duplicates([1,2,3,6,5,2]))

它首先对所有元素使用一次Counter，然后再对所有唯一元素使用。将第一个减去第二个就可以去除重复项。

- Anand Chitipothu

我喜欢这个解决方案看起来多么优雅。 - Jay Joshi

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- georg · Accepted Answer

要去除重复项，使用set(a)。如果要打印重复项，可以使用类似以下的代码：

a = [1,2,3,2,1,5,6,5,5,5]

import collections
print([item for item, count in collections.Counter(a).items() if count > 1])

## [1, 2, 5]

请注意，Counter 不是特别高效的工具 (计时数据)，在这里可能过度使用了。使用 set 会更快。此代码按原始顺序计算唯一元素列表：

seen = set()
uniq = []
for x in a:
    if x not in seen:
        uniq.append(x)
        seen.add(x)

或者，更简洁地说：

seen = set()
uniq = [x for x in a if x not in seen and not seen.add(x)]

我不建议使用后一种方式，因为不清楚not seen.add(x)是在做什么（集合的add()方法总是返回None，因此需要使用not）。

计算列表中重复元素的方法，不使用任何库：

seen = set()
dupes = []

for x in a:
    if x in seen:
        dupes.append(x)
    else:
        seen.add(x)

或者更简洁地说：

seen = set()
dupes = [x for x in a if x in seen or seen.add(x)]

如果列表元素不可哈希，你无法使用集合/字典，只能使用二次时间复杂度的解决方案（逐个元素进行比较）。例如：

a = [[1], [2], [3], [1], [5], [3]]

no_dupes = [x for n, x in enumerate(a) if x not in a[:n]]
print no_dupes # [[1], [2], [3], [5]]

dupes = [x for n, x in enumerate(a) if x in a[:n]]
print dupes # [[1], [3]]