Python：从列表中删除特定项的重复项

Question

Python：从列表中删除特定项的重复项

7

我有一个项目列表，我想删除其中任何一个项目的重复出现，但保留其他项目的重复项。即我从以下列表开始：

mylist = [4, 1, 2, 6, 1, 0, 9, 8, 0, 9]

我希望删除任何重复的0，但保留1和9的重复项。我的当前解决方案如下：

mylist = [i for i in mylist if i != 0]
mylist.add(0)

除了以下方式外，是否有保留一个0的好方法？

for i in mylist:
    if mylist.count(0) > 1:
        mylist.remove(0)

第二种方法在这个例子中所需的时间超过了两倍。

澄清：

- 目前，我不关心列表中项目的顺序，因为我在创建和清理后对其进行排序，但以后可能会改变。 - 目前，我只需要删除一个特定项目（在我的示例中为0）的重复项。

- Cryn

3

在这个列表中，顺序很重要吗？ - Daniel Pryden

3

你的第一个解决方案有什么问题吗？[0] + [i for i in mylist if i != 0] - Omar Einea

我认为你可能过于关注微小的性能差异，应该选择你当前的解决方案之一。 - Alex Hall

1

另外，您是否仅需要特定地删除重复的零，还是需要针对任意其他值的解决方案？ - Daniel Pryden

1

@DanielPryden 的意思是，可能会编写一个不同的函数，它期望一个已排序的列表，并且比任何其他解决方案都要快，特别是如果要删除的项是列表中可能最小的项（对于0来说很可能）。 - Alex Hall

显示剩余4条评论

9个回答

1

如果性能是一个问题，并且您愿意使用第三方库，请使用numpy。

Python标准库在许多方面都很棒。但对于数值数组的计算，它并不适用。

import numpy as np

mylist = np.array([4, 1, 2, 6, 1, 0, 9, 8, 0, 9])

mylist = np.delete(mylist, np.where(mylist == 0)[0][1:])

# array([4, 1, 2, 6, 1, 0, 9, 8, 9])

这里的np.delete的第一个参数是输入数组。第二个参数提取所有0的出现索引，然后从第二个实例开始提取。

性能基准测试

在Python 3.6.2 / Numpy 1.13.1上进行测试。性能将因系统和数组而异。

%timeit jp(myarr.copy())         # 183 µs
%timeit vui(mylist.copy())       # 393 µs
%timeit original(mylist.copy())  # 1.85 s

import numpy as np
from collections import Counter

myarr = np.array([4, 1, 2, 6, 1, 0, 9, 8, 0, 9] * 1000)
mylist = [4, 1, 2, 6, 1, 0, 9, 8, 0, 9] * 1000

def jp(myarr):
    return np.delete(myarr, np.where(myarr == 0)[0][1:])

def vui(mylist):
    return [0] + list(filter(None, mylist))

def original(mylist):
    for i in mylist:
        if mylist.count(0) > 1:
            mylist.remove(0)

    return mylist

- jpp

你所进行基准测试的计算机规格是什么？ - Nikhil Wagh

1

@NikhilWagh，我添加了Python + Numpy版本。我已经提供了代码供您测试。每台机器都会产生不同的结果。 - jpp

这个比较有点不公平，因为你假设列表已经是一个numpy数组。如果我们改变了输入类型，那么对于基于Counter的方法也应该做同样的处理。否则，将构建数组作为timeit测试的一部分。 - Daniel Pryden

@DanielPryden，在您的情况下，仍然存在一些假设。在几乎所有情况下，上游进程(从csv读取、从计算中检索等)可以通过转移到numpy来进一步优化。关键是，如果性能是问题，请考虑使用专门设计以提高性能的库。或者转移到C语言。 - jpp

1

听起来对你更适合使用的数据结构是collections.Counter（它在标准库中）：

import collections

counts = collections.Counter(mylist)
counts[0] = 1
mylist = list(counts.elements())

- Daniel Pryden

2

最好设置 counts[0] = min(1, counts[0])，否则这段代码会在一个没有任何元素的列表中插入 0。 - Aran-Fey

这是一个非常巧妙的想法，但是(1)几乎肯定比OP的解决方案慢，而且(2)不能保持顺序。 - Alex Hall

如果顺序无关紧要那就太好了。但是似乎顺序并不重要。 - Jean-François Fabre

@AlexHall：基本上只是一个鸽巢排序，时间复杂度为O(N)。为什么会比其他方法慢得多呢？ - Daniel Pryden

@DanielPryden 因为它将在 Python 空间中执行比内置空间（可能是 C）更多的代码。 - Alex Hall

这种方法的真正优势在于，如果 OP 可以在整个程序中使用 Counter 而不是 list，那么就能获得真正的胜利。使用正确的数据结构对程序的影响将比任何微观优化都更大。 - Daniel Pryden

1

这里提供一种基于生成器的方法，大约具有O(n)的复杂度，并且保留了原始列表的顺序：

In [62]: def remove_dup(lst, item):
    ...:     temp = [item]
    ...:     for i in lst:
    ...:         if i != item:
    ...:             yield i
    ...:         elif i == item and temp:
    ...:             yield temp.pop()
    ...:             

In [63]: list(remove_dup(mylist, 0))
Out[63]: [4, 1, 2, 6, 1, 0, 9, 8, 9]

此外，如果您正在处理更大的列表，您可以使用以下矢量化和优化的方法，使用Numpy：

In [80]: arr = np.array([4, 1, 2, 6, 1, 0, 9, 8, 0, 9])

In [81]: mask = arr == 0

In [82]: first_ind = np.where(mask)[0][0]

In [83]: mask[first_ind] = False

In [84]: arr[~mask]
Out[84]: array([4, 1, 2, 6, 1, 0, 9, 8, 9])

- Mazdak

为什么要使用 temp.pop()？而不是只使用一个布尔型本地变量？ - Daniel Pryden

@DanielPryden 因为这只是一个项目，保持代码整洁。而且它对性能没有明显的影响。 - Mazdak

1

切片应该可以。

a[start:end] # items start through end-1
a[start:]    # items start through the rest of the list
a[:end]      # items from the beginning through end-1
a[:]         # a copy of the whole list

输入：

mylist = [4,1, 2, 6, 1, 0, 9, 8, 0, 9,0,0,9,2,2,]
pos=mylist.index(0)
nl=mylist[:pos+1]+[i  for i in mylist[pos+1:] if i!=0]

print(nl)

输出：[4, 1, 2, 6, 1, 0, 9, 8, 9, 9, 2, 2]

- Ajay

1

你可以使用这个：

desired_value = 0
mylist = [i for i in mylist if i!=desired_value] + [desired_value]

现在您可以更改所需的值，也可以将其制作成如下列表。

desired_value = [0, 6]
mylist = [i for i in mylist if i not in desired_value] + desired_value

- Mehrdad Pedramfar

0

也许你可以使用一个 filter。

[0] + list(filter(lambda x: x != 0, mylist))

- Florian Vuillemot

2

你应该总是优先使用列表推导式而不是 filter + lambda。列表推导式会更短、更清晰，通常也更快。 - Daniel Pryden

filter(None,mylist) 更好。 - Jean-François Fabre

@DanielPryden 谢谢。 - Florian Vuillemot

@Jean-FrançoisFabre 当然可以！这里只是举个例子，它可以是“42”;-) - Florian Vuillemot

1

filter(None,x) 是一种情况，你不需要任何 lambda 函数或其他函数。它只保留“真实”的值。 - Jean-François Fabre

0

您可以使用itertools.count计数器，每次迭代时它将返回0、1等：

from itertools import count

mylist = [4, 1, 2, 6, 1, 0, 9, 8, 0, 9]

counter = count()

# next(counter) will be called each time i == 0
# it will return 0 the first time, so only the first time
# will 'not next(counter)' be True
out = [i for i in mylist if i != 0 or not next(counter)]
print(out)

# [4, 1, 2, 6, 1, 0, 9, 8, 9]

顺序被保留，并且可以轻松修改以消除任意数量的值的重复：

from itertools import count

mylist = [4, 1, 2, 6, 1, 0, 9, 8, 0, 9]

items_to_dedup = {1, 0}
counter = {item: count() for item in items_to_dedup}

out = [i for i in mylist if i not in items_to_dedup or not next(counter[i])]
print(out)

# [4, 1, 2, 6, 0, 9, 8, 9]

- Thierry Lathuille

-1

这是它的一行代码：其中m是要出现一次的数字，且顺序保持不变

[x for i,x in enumerate(mylist) if mylist.index(x)==i or x!=m]

结果

[4, 1, 2, 6, 1, 0, 9, 8, 9]

- Yasin Yousif

1

这种方法非常低效。对于所有项目调用具有O(n)复杂度的list.index加上两个条件检查！ - Mazdak

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jean-François Fabre · Accepted Answer

解决方案：

[0] + [i for i in mylist if i]

看起来还不错，除非mylist中没有0，否则您会错误地添加0。

此外，像这样添加2个列表在性能方面并不是很好。我会这样做：

newlist = [i for i in mylist if i]
if len(newlist) != len(mylist):  # 0 was removed, add it back
   newlist.append(0)

在IT技术方面，如果要向列表的最后位置添加元素，直接使用append方法非常高效。因为list对象使用了预分配技巧，并且大多数情况下不需要复制内存。同时，采用长度测试技巧可以避免对mylist进行0值判断，这种方法的时间复杂度为O(1)。或者你也可以使用过滤器语句newlist = list(filter(None,mylist))，尽管稍微快一些，因为它没有使用Python的循环结构。