高效地在Python中移除连续的成对重复项？

Question

高效地在Python中移除连续的成对重复项？

3

我有一堆长列表（数百万个元素），其中包含时间值和温度值 ([时间，温度])。这些列表看起来像这样：

mylist = [[1, 72], [2, 75], [3, 74], [4, 75], [5, 74], [6, 75], [7, 79], [8, 71], [9, 79], [10, 71], [11, 75], [12, 74]]

我想要做的是去除连续成对重复的元素。如果一对温度被连续重复，那么就去除这些重复元素（仅保留一个）。

可能有点晦涩，我将使用mylist提供一个例子: mylist[0]和mylist[1]是相邻的一对。同样，mylist[1]和mylist[2]也是相邻的一对。

接下来看看mylist中的温度值。从mylist[0]到mylist[11]，温度值为： 72 75 74 75 74 75 79 71 79 71 75 74 在上述温度值中，可以看到一对连续的75 74和79 71是重复的。我想要做的是保留一对，并去除重复的内容。所以，我想要的输出结果是：

output = [[1, 72], [2, 75], [3, 74], [6, 75], [7, 79], [8, 71], [11, 75], [12, 74]]

注意：元素 [11,75] 和 [12,74] 被保留，因为虽然它们也包含 75 74 模式，但与列表开头不同，它们不是连续重复的。

为了尝试解决这个问题，我查找并尝试了很多方法。最接近的解决方案是使用一个 for 循环创建一个解决方案，在该解决方案中，我将检查一个元素和前一个元素（索引-1），然后我将检查索引-2和索引-3，并且如果它们被确定为具有温度重复，则删除两个元素。然后，我会在向前的方向（index + 1）中重复此操作。这种方法有点起作用，但变得非常混乱，非常慢，并使我的计算机成为一个便携式加热器。所以，我想知道是否有人知道如何有效快速地消除这些连续重复对。

- George Orwell

模式的长度可以超过2吗？也就是说，[72, 75, 74] 可以是一个模式吗？ - Gilseung Ahn

@GilseungAhn 你好，感谢您的回复！模式应该只有长度为2。这是因为温度经常在两个点之间波动，我想消除这些波动，使数据文件更小。这样可以帮到您吗？ - George Orwell

3个回答

2

使用 collections.deque：

from collections import deque

mylist = [[1, 72], [2, 75], [3, 74], [4, 75], [5, 74], [6, 75], [7, 79], [8, 71], [9, 79], [10, 71], [11, 75], [12, 74]]

def generate(lst):
    d = deque(maxlen=4)
    for v in lst:
        d.append(v)
        if len(d)==4:
            if d[0][1] == d[2][1] and d[1][1] == d[3][1]:
                d.pop()
                d.pop()
            else:
                yield d.popleft()

    yield from d # yield the rest of deque


out = [*generate(mylist)]
print(out)

输出：

[[1, 72], [2, 75], [3, 74], [6, 75], [7, 79], [8, 71], [11, 75], [12, 74]]

基准测试（使用 10_000_000 个元素）：

import random
from timeit import timeit

mylist = []
for i in range(10_000_000):
    mylist.append([i, random.randint(50, 100)])

def f1():
    return [*generate(mylist)]

t1 = timeit(lambda: f1(), number=1)
print(t1)

在我的计算机上（AMD 2400G，Python 3.8），打印结果如下：

3.2782217629719526

- Andrej Kesely

1

对于一个很好的答案加1！再次感谢您的帮助，安德烈。您和Srikant的答案对我的数据都似乎非常有效，并且很棒的是它不使用任何第三方库。我希望我能将两者都标记为“已接受”。我刚刚标记了Srikant作为“最佳”，仅仅是因为他的积分更少 - 希望您别介意。再次感谢您的答案 - 这是一个非常出色的解决方案。 - George Orwell

1

使用 collection.Counters 和 numpy。

尝试这段代码。

import numpy as np
from collections import Counter

def remove_consecutive_pair_duplicate(L):
    temperature = np.array(L, dtype = str)[:, 1]
    l = 2 # length of pattern       
    pattern_with_length_l = Counter('-'.join(temperature[i:i+l]) for i in range(len(temperature) - l))

    set_of_patterns = []
    for (key, val) in pattern_with_length_l.items():
        left, right = key.split('-')        
        if val >= 2 and right + '-' + left not in set_of_patterns:
            set_of_patterns.append(key)

    removed_index = []
    for pattern in set_of_patterns:
        matched_index = [[i, i+1] for i in range(len(temperature) - l) if '-'.join(temperature[i:i+2]) == pattern]
        for ind in matched_index[1:]:
            removed_index.append(ind[0])
            removed_index.append(ind[1])

    survived_ind = list(set(list(range(len(L)))) - set(removed_index))
    return np.array(L)[survived_ind].tolist()

print(remove_consecutive_pair_duplicate(mylist))

结果如下。

[[1, 72], [2, 75], [3, 74], [6, 75], [7, 79], [8, 71], [11, 75], [12, 74]]

- Gilseung Ahn

+1 谢谢你的回答！不过，似乎 Andrej 的解决方案是最快的。非常棒的答案，无可厚非！ - George Orwell

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Srikant · Accepted Answer

假设我正确理解了需求，下面的代码可以完成任务。

mylist = [[1, 72], [2, 75], [3, 74], [4, 75], [5, 74], [6, 75], [7, 79], [8, 71], [9, 79], [10, 71], [11, 75], [12, 74]]

n = len(mylist)
index = 0
output_list = []

# We need at least four elements to check if there is a duplicate pair.
while index + 4 <= n:
    sub_list = mylist[index: index + 4]

    if sub_list[0][1] == sub_list[2][1] and sub_list[1][1] == sub_list[3][1]:
        print('Duplicate found')
        # Drop the second one.
        output_list.append(sub_list[0])
        output_list.append(sub_list[1])
        index += 4
    else:
        # We add only the first element as the there can be a potential duplicate that can be found later on when we consider the next element.
        output_list.append(sub_list[0])
        index += 1

# Append the remaining elements if any exist.
output_list.extend(mylist[index:])


print(output_list)