寻找常见的列表序列

Question

寻找常见的列表序列

6

我有一个列表的字典，每个列表都是一组数字序列。没有两个列表是相同的，但是两个或更多的列表可能以相同的数字序列开头（见下面的示例输入）。我想做的是找到这些共同的序列，并将它们作为字典中的新元素。

示例输入：

sequences = {
    18: [1, 3, 5, 6, 8, 12, 15, 17, 18],
    19: [1, 3, 5, 6, 9, 13, 14, 16, 19],
    25: [1, 3, 5, 6, 9, 13, 14, 20, 25],
    11: [0, 2, 4, 7, 11],
    20: [0, 2, 4, 10, 20],
    26: [21, 23, 26],
}

示例输出：

expected_output = {
    6: [1, 3, 5, 6],
    18: [8, 12, 15, 17, 18],
    14: [9, 13, 14],
    19: [16, 19],
    25: [20, 25],
    4: [0, 2, 4],
    11: [7, 11],
    20: [10, 20],
    26: [21, 23, 26],
}

每个列表的关键在于其最后一个元素。顺序无所谓。

我有一段可用的代码，但它很混乱。能否有人建议一个更简单/更清晰的解决方案？

from collections import Counter

def split_lists(sequences):
    # get first elem from each sequence
    firsts = list(map(lambda s: s[0], sequences))

    # get non-duplicate first elements
    not_duplicates = list(map(lambda c: c[0], filter(lambda c: c[1] == 1, Counter(firsts).items())))

    # start the new_sequences with the non-duplicate lists
    new_sequences = dict(map(lambda s: (s[-1], s), filter(lambda s: s[0] in not_duplicates, sequences)))

    # get duplicate first elements
    duplicates = list(map(lambda c: c[0], filter(lambda c: c[1] > 1, Counter(firsts).items())))
    for duplicate in duplicates:
        # get all lists that start with the duplicate element
        duplicate_lists = list(filter(lambda s: s[0] == duplicate, sequences))

        # get the common elements from the duplicate lists and make it a new
        # list to add to our new_sequences dict
        repeated_sequence = sorted(list(set.intersection(*list(map(set, duplicate_lists)))))
        new_sequences[repeated_sequence[-1]] = repeated_sequence

        # get lists from where I left of
        i = len(repeated_sequence)
        sub_lists = list(filter(lambda s: len(s) > 0, map(lambda s: s[i:], duplicate_lists)))

        # recursively split them and store them in new_sequences
        new_sequences.update(split_lists(sub_lists))

    return new_sequences

另外，你能帮我确定我算法的复杂度吗？递归让我头晕。我最好的猜测是 O(n*m)，其中n 是列表数量，m 是最长列表的长度。

- damores

4

要被认为是“序列”，常见子序列需要有多长？ - Ma0

2

这些列表已经排序了吗？ - Reut Sharabani

1

@ReutSharabani 是的，它们是。 - damores

1

这是最长公共子串问题。 - JustinDanielson

这个元素是如何出现的：18: [8, 12, 15, 17, 18]？ - Reut Sharabani

显示剩余3条评论

2个回答

2

使用一些函数工具，这是我想出的结果（假设序列已排序）。关键在于find_longest_prefixes函数：

#!/usr/bin/env python
from itertools import chain, takewhile
from collections import defaultdict

sequences = {
    18: [1, 3, 5, 6, 8, 12, 15, 17, 18],
    19: [1, 3, 5, 6, 9, 13, 14, 16, 19],
    25: [1, 3, 5, 6, 9, 13, 14, 20, 25],
    11: [0, 2, 4, 7, 11],
    20: [0, 2, 4, 10, 20],
    26: [21, 23, 26],
}

def flatmap(f, it):
    return chain.from_iterable(map(f, it))

def all_items_equal(items):
    return len(set(items)) == 1

def group_by_first_item(ls):
    groups = defaultdict(list)
    for l in ls:
        groups[l[0]].append(l)
    return groups.values()

def find_longest_prefixes(ls):
    # takewhile gives us common prefix easily
    longest_prefix = list(takewhile(all_items_equal, zip(*ls)))
    if longest_prefix:
       yield tuple(vs[0] for vs in longest_prefix)
    # yield suffix per iterable
    leftovers = filter(None, tuple(l[len(longest_prefix):] for l in ls))
    leftover_groups = group_by_first_item(leftovers)
    yield from flatmap(find_longest_prefixes, leftover_groups)

# apply the prefix finder to all groups
all_sequences = find_longest_prefixes(sequences.values())

# create the result structure expected
results = {v[-1]: v for v in all_sequences}

print(results)

结果的值为：

{4: (0, 2, 4),
 6: (1, 3, 5, 6),
 11: (7, 11),
 18: (8, 12, 15, 17, 18),
 19: (9, 13, 14, 16, 19),
 20: (10, 20),
 25: (9, 13, 14, 20, 25),
 26: (21, 23, 26)}

请注意，这些是元组，我更喜欢它们的不可变性。

- Reut Sharabani

1

非常好的答案。您需要进行第二次处理，进一步组合序列19和25。 - Maarten Fabré

递归比第二次遍历更好。我有机会时会添加它。 - Reut Sharabani

@MaartenFabré 已修复，已递归添加。 - Reut Sharabani

感谢您的出色回答。我喜欢使用生成器和itertools。代码比我的更易读。我认为总体上@MaartenFabré的代码更易读，所以我会选择他的答案。 - damores

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Maarten Fabré · Accepted Answer

将其分解为逻辑功能：

找出以相同元素开头的序列
查找共同元素

相同的开头：

可以轻松使用defaultdict完成

from collections import defaultdict
def same_start(sequences):
    same_start = defaultdict(list)
    for seq in sequences:
        same_start[seq[0]].append(seq)
    return same_start.values()

list(same_start(sequences.values()))

[[[1, 3, 5, 6, 8, 12, 15, 17, 18],
  [1, 3, 5, 6, 9, 13, 14, 16, 19],
  [1, 3, 5, 6, 9, 13, 14, 20, 25]],
 [[0, 2, 4, 7, 11], [0, 2, 4, 10, 20]],
 [[21, 23, 26]]]

查找共同元素：

这是一个简单的生成器，只要它们都相同，就会产生值。

def get_beginning(sequences):
    for values in zip(*sequences):
        v0 = values[0]
        if not all(i == v0 for i in values):
            return
        yield v0

聚合

def aggregate(same_start):
    for seq in same_start:
        if len(seq) < 2:
            yield  seq[0]
            continue
        start = list(get_beginning(seq))
        yield start
        yield from (i[len(start):] for i in seq)

list(aggregate(same_start(sequences.values())))

[[1, 3, 5, 6],
 [8, 12, 15, 17, 18],
 [9, 13, 14, 16, 19],
 [9, 13, 14, 20, 25],
 [0, 2, 4],
 [7, 11],
 [10, 20],
 [21, 23, 26]]

进一步了解

如果您想将序列18和25组合起来，可以尝试以下方法

def combine(sequences):
    while True:
        s = same_start(sequences)
        if all(len(i) == 1 for i in s):
            return sequences
        sequences = tuple(aggregate(s))

{i[-1]: i for i in combine(sequences.values())}

{4: [0, 2, 4],
 6: [1, 3, 5, 6],
 11: [7, 11],
 14: [9, 13, 14],
 18: [8, 12, 15, 17, 18],
 19: [16, 19],
 20: [10, 20],
 25: [20, 25],
 26: [21, 23, 26]}