寻找最长重叠范围

34

我有一个列表中的范围如下:

ranges = [(1, 50), (49, 70), (75, 85), (84, 88), (87, 92)]

我想找到这些区间中能够构成最长范围的区间 (当它们彼此重叠时)

[(1, 70), (75, 92)]

我有一个解决方案,但它过于复杂,我相信一定有更简单的方法来解决这个问题。

我的解决方案:

def overlap(x, y):
    return range(max(x[0], y[0]), min(x[-1], y[-1]) + 1)

ranges = [(1, 50), (49, 70), (75, 85), (84, 88), (87, 92)]

beg, end = min([x[0] for x in ranges]), 0
for i in ranges:
    if i[0] == beg:
        end = i[1]
while beg:
    for _ in ranges:
        for i in ranges:
            if i[1] > end and overlap(i, [beg, end]):
                end = i[1]
    print(beg, end)
    try:
        beg = min([x[0] for x in ranges if x[0] > end])
        for i in ranges:
            if i[0] == beg:
                end = i[1]
    except ValueError:
        beg = None

输出:

1 70
75 92

这不是一个与基因组坐标相关的问题(我看到你是生物信息学家)吗?如果你不需要将该函数嵌入到你自己的代码中,为什么不检查一下可用的工具,例如bedtools呢? - PlasmaBinturong
3
你的范围在端点处是开放还是闭合的? - Patrick Haugh
@PlasmaBinturong 我需要这段代码来解决一个非常简单的问题。我想要表示蛋白质序列中具有PDB结构的位置。由于存在许多重叠,我需要这样的代码。 - Gábor Erdős
@PatrickHaugh 他们已经关闭了。 - Gábor Erdős
@Georgy 那里被接受的答案比这里的任何答案都要好。 - JollyJoker
9个回答

13

我认为您可以按照范围的起始点对输入进行排序,然后遍历它们。在每个项目处,如果其起始点小于当前范围的结束点,则将其添加到当前范围中;否则,我们会输出当前范围并开始累积一个新范围:

def overlaps(ranges):
    ranges = sorted(ranges)  # If our inputs are garunteed sorted, we can skip this
    it = iter(ranges)
    try:
        curr_start, curr_stop = next(it)
        # overlaps = False  # If we want to exclude output ranges not produced by overlapping input ranges
    except StopIteration:
        return
    for start, stop in it:
        if curr_start <= start <= curr_stop:  # Assumes intervals are closed
            curr_stop = max(curr_stop, stop)
            # overlaps = True
        else:
            # if overlaps:
            yield curr_start, curr_stop
            curr_start, curr_stop = start, stop
            # overlaps = False
    # if overlaps:
    yield curr_start, curr_stop

print(list(overlaps([(1, 50), (49, 70), (75, 85), (84, 88), (87, 92)])))
# [(1, 70), (75, 92)]

print(list(overlaps([(20, 30), (5, 10), (1, 7), (12, 21)])))
# [(1, 10), (12, 30)]

我喜欢这个代码的易读性。不过,你需要特别处理第一个和最后一个范围吧?目前这个解决方案会输出它们,即使它们没有重叠,对吗? - PlasmaBinturong
@PlasmaBinturong 我错过了那个要求。我认为最简单的方法是在创建新的当前范围时添加一个布尔值 overlaps = False。如果范围重叠,我们将其更新为 True,并在生成范围之前检查它(如果 overlaps 为 false,则悄悄地丢弃它)。 - Patrick Haugh
同意。我几分钟前写了一个类似的答案,没有看到你的,对我的疏忽感到抱歉。 - PlasmaBinturong

9

您可以使用zip将每个范围对的起始值和结束值分组。如果开始值低于前一个结束值,则存在重叠,因此删除该开始和结束值。我们使用int来跟踪我们在查找每个低位和高位列表中的哪个索引,其中低位索引始终比高位索引高一个。


#split the numbers in to the low and high part of each range
#and set the index position for each of them
ranges = [(1, 50), (49, 70), (75, 85), (84, 88), (87, 92)]
low, high = [list(nums) for nums in zip(*ranges)] 
l, h = 1, 0

#Iterate over the ranges and remove when there is an overlap if no over lap move the pointers
while l < len(low) and h < len(high):
    if low[l] < high[h]:
        del low[l]
        del high[h]
    else:
        l +=1
        h +=1

#zip the low and high back into ranges
new_ranges = list(zip(low, high))
print(new_ranges)

输出

[(1, 70), (75, 92)]

6

可以使用functools.reduce实现:

from functools import reduce

ranges = [(1, 50), (45, 47), (49, 70), (75, 85), (84, 88), (87, 92)]

reducer = (
    lambda acc, el: acc[:-1:] + [(min(*acc[-1], *el), max(*acc[-1], *el))]
    if acc[-1][1] > el[0]
    else acc + [el]
)
print(reduce(reducer, ranges[1::], [ranges[0]]))

提供:

[(1, 70), (75, 92)]

难以用言语表达,但它使用reduce来遍历范围。如果范围中的最后一个元组与下一个提供的重叠(if acc[-1][1] > el[0]),则从两个元组的(min, max)创建一个新的范围,然后用这个新合并的范围替换它前面的内容(acc[:-1:] + [(min, max)]),否则只需将新的范围添加到末尾 (acc + [el])。

编辑:在查看其他答案后,更新为比较两个范围的最小/最大值,而不仅仅是第一个和最后一个。


4

我建议您仅迭代一次范围,但将当前正在扩展的范围存储在内存中,如下所示:

def overlaps(r1, r2):
    assert r1[0] <= r2[0], "Assume ranges sorted by first coordinate"
    return (r2[0] <= r1[1] <= r2[1]) or (r1[0] <= r2[0] <= r1[1])

ranges = [(1, 50), (45, 47), (49, 70), (75, 85), (84, 88), (87, 92)]


def fuse_ranges(ranges):
    output_ranges = []
    curr_r = list(ranges[0])
    curr_overlap = False  # Is the current range already overlapping?

    # Assuming it is sorted by starting coordinate.
    for r in ranges[1:]:
        if overlaps(curr_r, r):
            curr_overlap = True
            curr_r[1] = max(curr_r[1], r[1])  # Extend the end of the current range.
        else:
            if curr_overlap:
                output_ranges.append(curr_r)
                curr_overlap = False
            curr_r = list(r)
    if curr_overlap:
        output_ranges.append(curr_r)

    return output_ranges


if __name__ == '__main__':
    print(fuse_ranges(sorted(ranges, key=lambda r: r[0])))

该输出结果为:

[[1, 70], [75, 92]]

我不确定我的解决方案能否比你的更简洁...


3
你可以使用collections包中的Counter容器,然后使用itertools获取Counter对象的组合,并对其执行集合操作。
ranges = [(1, 50), (49, 70), (75, 85), (84, 88), (87, 92)]
import collections, itertools
import numpy as np

out = []
for range in ranges:
    data = np.arange(range[0], range[1]+1)
    out.append(collections.Counter(data))

for x,y in list(itertools.combinations(out, 2)): # combinations of two
    if x & y: # if they overlap
        print(x | y) # get their union

使用这个将会有接近您所需的结果:

Counter({1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1, 22: 1, 23: 1, 24: 1, 25: 1, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 1, 32: 1, 33: 1, 34: 1, 35: 1, 36: 1, 37: 1, 38: 1, 39: 1, 40: 1, 41: 1, 42: 1, 43: 1, 44: 1, 45: 1, 46: 1, 47: 1, 48: 1, 49: 1, 50: 1, 51: 1, 52: 1, 53: 1, 54: 1, 55: 1, 56: 1, 57: 1, 58: 1, 59: 1, 60: 1, 61: 1, 62: 1, 63: 1, 64: 1, 65: 1, 66: 1, 67: 1, 68: 1, 69: 1, 70: 1})
Counter({75: 1, 76: 1, 77: 1, 78: 1, 79: 1, 80: 1, 81: 1, 82: 1, 83: 1, 84: 1, 85: 1, 86: 1, 87: 1, 88: 1})
Counter({84: 1, 85: 1, 86: 1, 87: 1, 88: 1, 89: 1, 90: 1, 91: 1, 92: 1})

如果您对多个层面都这样做,就会得到想要的超集。您可以在这里找到更多有关如何使用Counter的信息

3

问题:查找区间中最长的重叠范围

ranges1 = [(1, 50), (49, 70), (75, 85), (84, 88), (87, 92)]
ranges2 = [(1, 50), (40,45), (49, 70)]


def get_overlapping(ranges):
    result = []
    start = 0
    end = ranges[0][1]

    for i, node in enumerate(ranges[1:], 1):
        if end > node[0]:
            if end < node[1]:
                end = node[1]
            continue

        result.append((start, i - 1))
        end = node[1]
        start = i

    else:
        result.append((start, i))
    return result

使用方法:

for _range in [ranges1, ranges2]:
    result = get_overlapping(_range)
    for o in result:
        start, end = _range[o[0]], _range[o[1]]
        print(start[0], end[1])
    print()

输出:

1 70
75 92

1 70

如果你有以下范围:[(1, 50), (40,45), (49, 70)],人们期望输出的是"1 70",但由于你的代码行if ranges[i - 1][1] > node[0],它将无法给出正确的合并范围。 - PlasmaBinturong
@PlasmaBinturong:“人们会期望输出"1 70"”,为什么,45),(49,`没有重叠? - stovfl
也许楼主可以澄清一下,但我理解他想要合并重叠的序列。在我提出的列表中,(1,50)(40,45)(49,70) 都有重叠。 - PlasmaBinturong
@PlasmaBinturong 明白了,所以应该跳过 (40, 45) - stovfl

3

以下是一个简单的迭代函数:

def merge_range(rng):
    starts, ends = [], []   
    for i, (x, y) in enumerate(rng):
        if i > 0:
            if x<= ends[-1]:
                ends[-1] = y
                continue
        starts.append(x)
        ends.append(y)
    return list(zip(starts, ends))

输出:

merge_range([(1, 50), (49, 70), (75, 85), (84, 88), (87, 92)])
# [(1, 70), (75, 92)]

merge_range([(1, 50), (49, 70), (75, 85), (84, 88), (87, 92), (99, 102), (105, 111), (150, 155), (152, 160), (154, 180)])
# [(1, 70), (75, 92), (99, 102), (105, 111), (150, 180)]

2

使用集合来消除重复项,使用排序列表来迭代,下面的代码应该可以工作。

代码:

ranges = [(1, 50), (49, 70), (75, 85), (84, 88), (87, 92)]

all_nums = sorted(list(set(x for r in ranges for x in range(r[0], r[1]))))

i = all_nums[0]
print(i, end=' ')
while i < all_nums[-1]:
    if i not in all_nums:
        print(i)
        i = all_nums[all_nums.index(i-1) + 1]
        print(i, end = ' ')
    i += 1
print(i+1)

输出:

ranges = [(1, 50), (49, 70), (75, 85), (84, 88), (87, 92)]

1 70
75 92


ranges = [(1, 50), (55, 70), (75, 82), (84, 88), (87, 92)]

1 50
55 70
75 82
84 92

2

大多数已发布的答案使用循环。您是否考虑过使用递归解决方案

def merge(ranges):
  """Given a sorted list of range tuples `(a, b)` merge overlapping ranges."""

  if not(ranges):
     return [];

  if len(ranges) == 1:
    return ranges;

  a, b = ranges[0];
  c, d = ranges[1];

  # eg.: [(1, 10), (20, 30), rest]
  if b < c:
    return [(a,b)] + merge(ranges[1:]);

  # examples: [(1, 5), (2, 3), rest],
  #           [(1, 5), (2, 10), rest]
  return merge([(a, max(b, d))] + ranges[2:]);

例子

>>> merge([(1, 50), (49, 70), (75, 85), (84, 88), (87, 92)])
[(1, 70), (75, 92)]
>>> merge([(1,10), (2,3), (2,3), (8,12)])
[(1, 12)]
>>> merge (sorted([(2,5),(1,3)], key = lambda x: x[0]))
[(1, 5)]

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接