找到排序整数列表中发生变化的索引

Question

找到排序整数列表中发生变化的索引

6

假设有一个如下所示的已排序整数列表：

data = [1] * 3 + [4] * 5 + [5] * 2 + [9] * 3
# [1, 1, 1, 4, 4, 4, 4, 4, 5, 5, 9, 9, 9]

我想要找到数值发生变化的位置，即

[3, 8, 10, 13]

一种方法是使用itertools.groupby:

cursor = 0
result = []
for key, group in groupby(data):
    cursor += sum(1 for _ in group)
    result.append(cursor)
print(result)

输出

[3, 8, 10, 13]

这种方法的时间复杂度为 O(n)。另一种可能的方法是使用 bisect.bisect_left：

cursor = 0
result = []
while cursor < len(data):
    cursor = bisect_left(data, data[cursor] + 1, cursor, len(data))
    result.append(cursor)
print(result)

输出

[3, 8, 10, 13]

这种方法的时间复杂度为O(k*log n)，其中k是不同元素的数量。该方法的一种变体是使用指数搜索算法。

还有更快或更高效的方法吗？

- Dani Mesejo

3

你需要哪种更高效的算法？如果你寻找渐进复杂度，我认为你已经找到了最优解决方案。在O(n)和O(k*log n)之间的权衡取决于k，这可能是您事先不知道的。但如果你要寻找实践中最快的算法，我预计一个简单的for循环检查元素与之前的元素是否相同将是最快的，除非k远小于n。但你需要进行基准测试来确保。 - Vincent van der Weele

1

在这种情况下，我预计指数搜索和二分法会是最快的。不知何故，指数搜索感觉更快，因为你只向数组前进，但我认为那只是我的直觉在作怪。二分法可能更容易实现（因为你已经有它了，而且只需要几行代码），所以我会选择它。但我很想看到有人运行基准测试来确保。 - Vincent van der Weele

@trincot 抱歉，我表达不够准确，但是13应该是输出的一部分。 - Dani Mesejo

我正在使用这个子程序作为更大算法的一部分，但是输出中可能包含0。 - Dani Mesejo

1

我认为bisect（data，data [cursor]，cursor）也可以并且更好（代码更少，不需要值为整数）。 - no comment

显示剩余3条评论

3个回答

4

我在两组数据上测试了你的方法的执行时间，并使用了numpy添加了第三组数据。

data1 = [1] * 30000000 + [2] * 30000000 + [4] * 50000000 + [5] * 20000000 + [7] * 40000000 + [9] * 30000000 + [11] * 10000000 + [15] * 30000000
data2 = list(range(10000000))

cursor = 0
result = []
start_time = time.time()
for key, group in groupby(data):
    cursor += sum(1 for _ in group)
    result.append(cursor)
print(f'groupby {time.time() - start_time} seconds')

cursor = 0
result = []
start_time = time.time()
while cursor < len(data):
    cursor = bisect_left(data, data[cursor] + 1, cursor, len(data))
    result.append(cursor)
print(f'bisect_left {time.time() - start_time} seconds')

data = np.array(data)
start_time = time.time()
[i + 1 for i in np.where(data[:-1] != data[1:])[0]] + [len(data)]
print(f'numpy {time.time() - start_time} seconds')

# We need to iterate over the results array to add 1 to each index for your expected results.

使用data1

groupby 8.864859104156494 seconds
bisect_left 0.0 seconds
numpy 0.27180027961730957 seconds

使用data2

groupby 3.602466583251953 seconds
bisect_left 5.440978765487671 seconds
numpy 2.2847368717193604 seconds

如你所述，bisect_left 非常依赖于唯一元素的数量，但是似乎使用 numpy 比 itertools.groupby 更具性能优势，即使在索引列表上进行额外迭代。

- Guy

你的numpy时间处理有些奇怪，好像你无法确定是否在使用numpy。我建议要么将data = np.array(data)包含在计时中，要么将索引列表推导式移出（根据问题陈述应该是前者，但后者也可以作为补充）。 - no comment

2

既然您说“我更感兴趣的是运行时”，这里有一些更快的替代方案，可以替换您的groupby解决方案中的cursor += sum(1 for _ in group)。

使用@Guy的data1，但所有段长度都除以10：

             normal  optimized
original     870 ms  871 ms
zip_count    612 ms  611 ms
count_of     344 ms  345 ms
list_index   387 ms  386 ms
length_hint  223 ms  222 ms

使用list(range(1000000))代替：

             normal  optimized
original     385 ms  331 ms
zip_count    463 ms  401 ms
count_of     197 ms  165 ms
list_index   175 ms  143 ms
length_hint  226 ms  127 ms

优化版本使用更多的本地变量或列表推导。

我认为即使对于列表迭代器，__length_hint__也不能保证是精确的，但它似乎是（通过我的正确性检查），我不明白为什么不会这样。

以下是需要翻译的内容（请保留HTML标签）：

这段代码 (在线尝试)，但您需要缩小一些内容以不超过时间限制：

from timeit import default_timer as timer
from itertools import groupby, count
from collections import deque
from operator import countOf

def original(data):
    cursor = 0
    result = []
    for key, group in groupby(data):
        cursor += sum(1 for _ in group)
        result.append(cursor)
    return result

def original_opti(data):
    cursor = 0
    sum_ = sum
    return [cursor := cursor + sum_(1 for _ in group)
            for _, group in groupby(data)]

def zip_count(data):
    cursor = 0
    result = []
    for key, group in groupby(data):
        index = count()
        deque(zip(group, index), 0)
        cursor += next(index)
        result.append(cursor)
    return result

def zip_count_opti(data):
    cursor = 0
    result = []
    append = result.append
    count_, deque_, zip_, next_ = count, deque, zip, next
    for key, group in groupby(data):
        index = count_()
        deque_(zip_(group, index), 0)
        cursor += next_(index)
        append(cursor)
    return result

def count_of(data):
    cursor = 0
    result = []
    for key, group in groupby(data):
        cursor += countOf(group, key)
        result.append(cursor)
    return result

def count_of_opti(data):
    cursor = 0
    countOf_ = countOf
    result = [cursor := cursor + countOf_(group, key)
              for key, group in groupby(data)]
    return result

def list_index(data):
    cursor = 0
    result = []
    for key, _ in groupby(data):
        cursor = data.index(key, cursor)
        result.append(cursor)
    del result[0]
    result.append(len(data))
    return result

def list_index_opti(data):
    cursor = 0
    index = data.index
    groups = groupby(data)
    next(groups, None)
    result = [cursor := index(key, cursor)
              for key, _ in groups]
    result.append(len(data))
    return result

def length_hint(data):
    result = []
    it = iter(data)
    for _ in groupby(it):
        result.append(len(data) - (1 + it.__length_hint__()))
    del result[0]
    result.append(len(data))
    return result

def length_hint_opti(data):
    it = iter(data)
    groups = groupby(it)
    next(groups, None)
    n_minus_1 = len(data) - 1
    length_hint = it.__length_hint__
    result = [n_minus_1 - length_hint()
              for _ in groups]
    result.append(len(data))
    return result

funcss = [
    (original, original_opti),
    (zip_count, zip_count_opti),
    (count_of, count_of_opti),
    (list_index, list_index_opti),
    (length_hint, length_hint_opti),
]

data1 = [1] * 3 + [2] * 3 + [4] * 5 + [5] * 2 + [7] * 4 + [9] * 3 + [11] * 1 + [15] * 3
data1 = [x for x in data1 for _ in range(1000000)]
data2 = list(range(1000000))

for _ in range(3):
    for name in 'data1', 'data2':
        print(name)
        data = eval(name)
        expect = None
        for funcs in funcss:
            print(f'{funcs[0].__name__:11}', end='')
            for func in funcs:
                times = []
                for _ in range(5):
                    start_time = timer()
                    result = func(data)
                    end_time = timer()
                    times.append(end_time - start_time)
                print(f'{round(min(times) * 1e3):5d} ms', end='')
                if expect is None:
                    expect = result
                else:
                    assert result == expect
            print()
        print()

- no comment

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- trincot · Accepted Answer

当谈到渐近复杂度时，我认为在二分查找中应用更加均匀的分治方法可以稍微改进平均性能: 首先试图定位更靠近输入列表中心发生的值变化，从而将范围分成大约两半，这将减少下一次二分搜索操作路径约一半。

然而，由于这是Python语言，可能不会有明显的收益，因为存在Python代码开销（如yield，yield from，递归等）。在您使用的列表大小方面，它甚至可能表现更差:

from bisect import bisect_left

def locate(data, start, end):
    if start >= end or data[start] == data[end - 1]:
        return
    mid = (start + end) // 2
    val = data[mid] 
    if val == data[start]:
        start = mid
        val += 1
    i = bisect_left(data, val, start + 1, end)
    yield from locate(data, start, i)
    yield i
    yield from locate(data, i, end)

data = [1] * 3 + [4] * 5 + [5] * 2 + [9] * 3
print(*locate(data, 0, len(data)))  # 3 8 10

请注意，这仅输出有效的索引，因此对于此示例输入，不包括13。