如何将文本分成最小的块?

8

概述

我有一组可能有效的文本块,可以使用它们来分割文本(如果可能的话)。

如何使用这些块来分割给定的文本,使结果在生成的块数量方面得到优化(最小化)?

测试套件

if __name__ == "__main__":
    import random
    import sys

    random.seed(1)

    # 1) Testing robustness
    examples = []
    sys.stdout.write("Testing correctness...")
    N = 50
    large_number = "3141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117067982148086513282306647093844609550582231725359408128481"
    for i in range(100):
        for j in range(i):
            choices = random.sample(range(i), j)
            examples.append((choices, large_number))

    for (choices, large_number) in examples:
        get_it_done(choices, large_number)
    sys.stdout.write("OK")

    # 2) Testing correctness
    examples = [
        # Example1 ->
        # Solution ['012345678910203040506070', '80', '90', '100', '200', '300', '400', '500', '600', '700', '800', '900']
        (
            [
                "0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
                "10", "20", "30", "40", "50", "60", "70", "80", "90",
                "100", "200", "300", "400", "500", "600", "700", "800", "900",
                "012345678910203040506070"
            ],
            "0123456789102030405060708090100200300400500600700800900"
        ),
        # Example2
        ## Solution ['100']
        (
            ["0", "1", "10", "100"],
            "100"
        ),
        # Example3
        ## Solution ['101234567891020304050', '6070809010020030040050', '0600700800900']
        (
            [
                "10", "20", "30", "40", "50", "60", "70", "80", "90",
                "012345678910203040506070",
                "101234567891020304050",
                "6070809010020030040050",
                "0600700800900"
            ],
            "10123456789102030405060708090100200300400500600700800900"
        ),
        # Example4
        ### Solution ['12', '34', '56', '78', '90']
        (
            [
                "12", "34", "56", "78", "90",
                "890",
            ],
            "1234567890"
        ),
        # Example5
        ## Solution ['12', '34']
        (
            [
                "1", "2", "3",
                "12", "23", "34"
            ],
            "1234"
        ),
        # Example6
        ## Solution ['100', '10']
        (
            ["0", "1", "10", "100"],
            "10010"
        )
    ]

    score = 0
    for (choices, large_number) in examples:
        res = get_it_done(choices, large_number)
        flag = "".join(res) == large_number
        print("{0}\n{1}\n{2} --> {3}".format(
            large_number, "".join(res), res, flag))
        print('-' * 80)
        score += flag

    print(
        "Score: {0}/{1} = {2:.2f}%".format(score, len(examples), score / len(examples) * 100))

    # 3) TODO: Testing optimization, it should provide (if possible)
    #          minimal cases

问题

我该如何在不使用暴力方法的情况下解决这个Python问题?


1
天真的解决方案似乎是从您的块中最大的开始,在字符串中查找它。如果找到它,请将字符串拆分为两个子字符串,然后递归考虑这些字符串。 - Patrick Haugh
我的第一反应也是这样,但我不确定这是否保证产生最优解。可能可以用一个长块和两个短块或两个中等大小的块来替换一个部分。在这种情况下,该算法将无法产生最优解。 - Leon
贪心算法通常可以创建一个好的解决方案,即使不是最佳解决方案。好处在于实现起来很简单。您可能可以通过动态规划获得最佳解决方案,但这可能需要一些时间。事实上,最佳解决方案可能是计算难度大的。 - Jim Mischel
1
你所说的“chunks”是指子字符串或标记吗?我的第一个想法是文本压缩(霍夫曼编码),如果你可以自己制作(前缀自由)的标记,但这并不真正符合问题。我同意动态规划很可能是解决方案,但这是一个非常广泛的主题。 - Kenny Ostrom
第一条评论来自@PatrickHaugh,提供了基本的想法。 - Jim Mischel
4个回答

7
使用动态规划,您可以构建一个列表 (l0, l1, l2, ... ln-1),其中n是输入字符串中字符的数量,li是到达输入字符串的第i个字符所需的最小块数。整体结构如下:
minValues := list with n infinity entries
for i from 0 to n-1
    for every choice c that is a suffix of input[0..i]
        if i - len(c) < 0
            newVal = 1
        else
            newVal = minValues[i - len(c)] + 1
        end if
        if(newVal < minValues[i])
            minValues[i] = newVal
            //optionally record the used chunk
        end if
    next
next

整个字符串所需的最小分块数为ln-1。您可以通过回溯列表来获取实际的分块(这需要记录使用过的分块)。
使用倒置选择字符串的trie可以加快检索后缀的选择。最坏情况下的复杂度仍然是O(n * c * lc),其中n是输入字符串的长度,c是选择的数量,lc是选择的最大长度。但是,这种复杂度只会出现在嵌套的后缀选择中(例如0100100010等)。在这种情况下,trie将退化为列表。平均而言,运行时间应该要少得多。在假设从trie中检索到的选择数量始终是一个小常数的情况下,它是O(n * lc)(实际上,lc因子可能也要小得多)。
这里是一个例子:
choices = ["0","1","10","100"]
text = "10010"

algorithm step    content of minValues
                   0      1       2        3      4
---------------------------------------------------------
initialize        (∞,     ∞ ,     ∞ ,      ∞ ,    ∞     )
i = 0, c = "1"    (1 "1", ∞ ,     ∞ ,      ∞ ,    ∞     )
i = 1, c = "0"    (1 "1", 2 "0",  ∞ ,      ∞ ,    ∞     )
i = 1, c = "10"   (1 "1", 1 "10", ∞ ,      ∞ ,    ∞     )
i = 2, c = "0"    (1 "1", 1 "10", 2 "0",   ∞ ,    ∞     )
i = 2, c = "100"  (1 "1", 1 "10", 1 "100", ∞ ,    ∞     )
i = 3, c = "1"    (1 "1", 1 "10", 1 "100", 2 "1", ∞     )
i = 4, c = "0"    (1 "1", 1 "10", 1 "100", 2 "1", 3 "0" )
i = 4, c = "10"   (1 "1", 1 "10", 1 "100", 2 "1", 2 "10")

意思是:我们可以用两个块来组合这个字符串。追溯回去,以相反的顺序得到这些块:"10","100"。

我正在考虑输入[0..i]无法由给定选择组成的情况,这似乎没问题。你只需忽略不可用的后缀,直到在后面的i处找到可以组合的内容。如果问题无法解决,我们可以通过将重建的最终答案与输入字符串进行比较来确定。同意吗? - Kenny Ostrom
@BPL:一个简单的实现(不使用trie)如下所示:对于每个选择“c”,请检查以下内容:最后一个字符是否等于“input[i]”,倒数第二个字符是否等于“input[i-1]”,以此类推,直到完全消耗“c”。 - Nico Schertler
@Kenny 是的,无法组成的部分字符串保持为。检查整个字符串是否可组合的更简单的方法是检查minValues[n-1]是否为 - Nico Schertler
@Nico 一个字符串可能在中间无法组合,但可能具有可组合的前缀和后缀。这似乎直到最终重构(不匹配输入字符串)才会被检测到。但仍然得到答案,+1。 - Kenny Ostrom
@BPL:也许可以通过修改后缀树来找到解决方案,但我没有好的想法。后缀树也像我为后缀检索建议的trie一样。因此,也许你可以巧妙地将这两个结构组合起来。但我不知道。 - Nico Schertler
显示剩余2条评论

4
抱歉,实现方式有些不规范。但我认为它始终返回最佳答案。(虽然没有证明过。)这是一个快速而完整的python实现,在所有提出的用例中都能返回正确答案。
该算法是递归的,具体如下:
1.从文本开头开始。 2.查找可以用作第一块的匹配块。 3.对于每个匹配块,递归地使用剩余文本(即从开头删除的块)重新开始步骤1,并收集解决方案。 4.返回收集到的解决方案中最短的一个。
当算法完成时,所有可能的路径(包括不可能的路径,即在结尾处没有匹配的路径)应该已经被精确地遍历了一次。
为了有效地执行步骤2,我构建了选择的patricia树,因此可以快速查找与文本开头匹配的可能块。
def get_seq_in_tree(tree, choice):
    if type(tree)!=dict:
        if choice == tree:
            return [choice]
        return []
    for i in range(1, len(choice)+1):
        if choice[:i] in tree:
            return [choice[:i]] + get_seq_in_tree(tree[choice[:i]], choice[i:])
    return []

def seq_can_end_here(tree, seq):
    res = []
    last = tree
    for e, c in enumerate(seq):
        if '' in last[c]:
            res.append(e+1)
        last = last[c]
    return res

def build_tree(choices):
    tree = {}
    choices = sorted(choices)
    for choice in choices:
        last = tree
        for c in choice:
            if c not in last:
                last[c] = {}
            last = last[c]
        last['']=None
    return tree

solution_cache = {}
ncalls = 0

def solve(tree, number):
    global solution_cache
    global ncalls
    ncalls +=1

    # take every path only once
    if number in solution_cache: 
        return solution_cache[number]

    solutions = []
    seq =  get_seq_in_tree(tree, number)
    endings = seq_can_end_here(tree, seq)
    for i in reversed(endings):
        current_solution = []
        current_solution.append(number[:i])
        if i == len(number):
            solutions.append(current_solution)
        else:
            next_solution = solve(tree, number[i:])
            if next_solution:
                solutions.append(current_solution + next_solution)
    if not solutions:
        return None

    shortest_solution = sorted([(len(solution), solution) for solution in solutions])[0][1]

    solution_cache[number] = shortest_solution
    return shortest_solution

def get_it_done(choices, number):
    tree = build_tree(choices)
    solution = solve(tree, number)
    return solution


if __name__ == "__main__":

    examples = [
        # Example1 ->
        # Solution ['012345678910203040506070', '80', '90', '100', '200', '300', '400', '500', '600', '700', '800', '900']
        (
            [
                "0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
                "10", "20", "30", "40", "50", "60", "70", "80", "90",
                "100", "200", "300", "400", "500", "600", "700", "800", "900",
                "012345678910203040506070"
            ],
            "0123456789102030405060708090100200300400500600700800900"
        ),
        ## Example2
        ## Solution ['100']
        (
            ["0", "1", "10", "100"],
            "100"
        ),
        ## Example3
        ## Solution ['101234567891020304050', '6070809010020030040050', '0600700800900']
        (
            [
                "10", "20", "30", "40", "50", "60", "70", "80", "90",
                "012345678910203040506070",
                "101234567891020304050",
                "6070809010020030040050",
                "0600700800900"
            ],
            "10123456789102030405060708090100200300400500600700800900"
        ),
        ### Example4
        ### Solution ['12', '34', '56', '78', '90']
        (
            [
                "12", "34", "56", "78", "90",
                "890",
            ],
            "1234567890"
        ),
        ## Example5
        ## Solution ['12', '34']
        (
            [
                "1", "2", "3",
                "12", "23", "34"
            ],
            "1234"
        ),
        # Example6
        ## Solution ['100', '10']
        (
            ["0", "1", "10", "100"],
            "10010"
        )
    ]

    score = 0
    for (choices, large_number) in examples:
        res = get_it_done(choices, large_number)
        flag = "".join(res) == large_number
        print("{0}\n{1}\n{2} --> {3}".format(
            large_number, "".join(res), res, flag))
        print('-' * 80)
        score += flag

    print("Score: {0}/{1} = {2:.2f}%".format(score, len(examples), score / len(examples) * 100))

我猜测复杂度大概是O(L * N * log(C)),其中L是文本长度,N是词汇表大小,C是选择数量。 编辑:包括了缺失的测试用例。

是的,感谢您的评论。不知道第六个测试怎么会被删除...但是最后一个案例也符合预期。 - Peter

2
def find_shortest_path(graph, start, end, path=[]):
    path = path + [start]
    if start == end:
        return path
    if start not in graph:
        return None
    shortest = None
    for node in graph[start]:
        if node not in path:
            newpath = find_shortest_path(graph, node, end, path)
            if newpath:
                if not shortest or len(newpath) < len(shortest):
                    shortest = newpath
    return shortest


def get_it_done(choices, number):
    mapping = {}
    graph = {} 

    for choice in choices:
        if choice in number:
            _from = number.index(choice)
            _to = _from + len(choice)
            mapping.setdefault((_from, _to), choice)

    items = sorted(mapping.items(), key=lambda x: x[0])
    for _range, value in items:
        _from, _to = _range
        graph.setdefault(_from, []).append(_to)
    start = 0
    end = _range[1] #this is hack, works only in python 2.7
    path = find_shortest_path(graph, start, end) 
    ranges = [tuple(path[i:i+2]) for i in range(len(path) - 1)]
    if len(ranges) == 1:
        items = sorted(choices, key=len, reverse=True)
        number_length = len(number) 
        result = ''
        for item in items:
            result += item
            if len(result) == number_length: 
                return result 
    return [mapping[_range] for _range in ranges]


if __name__ == "__main__":
    examples = [
        # Example1 ->
        # Solution ['012345678910203040506070', '80', '90', '100', '200', '300', '400', '500', '600', '700', '800', '900']
        (
            [
                "0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
                "10", "20", "30", "40", "50", "60", "70", "80", "90",
                "100", "200", "300", "400", "500", "600", "700", "800", "900",
                "012345678910203040506070"
            ],
            "0123456789102030405060708090100200300400500600700800900"
        ),
        ## Example2
        ## Solution ['100']
        (
            ["0", "1", "10", "100"],
            "100"
        ),
        ## Example3
        ## Solution ['101234567891020304050', '6070809010020030040050', '0600700800900']
        (
            [
                "10", "20", "30", "40", "50", "60", "70", "80", "90",
                "012345678910203040506070",
                "101234567891020304050",
                "6070809010020030040050",
                "0600700800900"
            ],
            "10123456789102030405060708090100200300400500600700800900"
        ),
        ### Example4
        ### Solution ['12', '34', '56', '78', '90']
        (
            [
                "12", "34", "56", "78", "90",
                "890",
            ],
            "1234567890"
        ),
        ## Example5
        ## Solution ['12', '34']
        (
            [
                "1", "2", "3",
                "12", "23", "34"
            ],
            "1234"
        ),
        # Example6
        ## Solution ['100', '10']
        (
            ["0", "1", "10", "100"],
            "10010"
        )
    ]

    score = 0
    for (choices, large_number) in examples:
        res = get_it_done(choices, large_number)
        flag = "".join(res) == large_number
        print("{0}\n{1}\n{2} --> {3}".format(
            large_number, "".join(res), res, flag))
        print('-' * 80)
        score += flag

    print(
        "Score: {0}/{1} = {2:.2f}%".format(score, len(examples), score / len(examples) * 100))

get_it_done函数首先创建一个mapping,其中键是每个choicenumber中出现的范围。然后按照mapping字典中每个键的第一个项目进行排序。下一步是创建graph。然后使用find_shortest_path函数,我们可以找到以最优方式构建结果的最短路径。因此,在最后,我们可以再次使用mapping,根据它们的范围返回choices。如果只有一个范围,我们就会遇到所有数字都由相同的两个值组成的情况,因此规则也会不同。我们可以直接从choices(按降序排序)收集数字,直到结果的长度与number的长度相同。


-3
def find_shortest_path(graph, start, end, path=[]):
    path = path + [start]
    if start == end:
        return path
    if start not in graph:
        return None
    shortest = None
    for node in graph[start]:
        if node not in path:
            newpath = find_shortest_path(graph, node, end, path)
            if newpath:
                if not shortest or len(newpath) < len(shortest):
                    shortest = newpath
    return shortest


def get_it_done(choices, number):
    mapping = {}
    graph = {} 

    for choice in choices:
        if choice in number:
            _from = number.index(choice)
            _to = _from + len(choice)
            mapping.setdefault((_from, _to), choice)

    items = sorted(mapping.items(), key=lambda x: x[0])
    for _range, value in items:
        _from, _to = _range
        graph.setdefault(_from, []).append(_to)
    start = 0
    end = _range[1] #this is hack, works only in python 2.7
    path = find_shortest_path(graph, start, end) 
    ranges = [tuple(path[i:i+2]) for i in range(len(path) - 1)]
    if len(ranges) == 1:
        return [mapping[(start, graph[start][-1])]]
    return [mapping[_range] for _range in ranges]


if __name__ == "__main__":
    examples = [
        # Example1 ->
        # Solution ['012345678910203040506070', '80', '90', '100', '200', '300', '400', '500', '600', '700', '800', '900']
        (
            [
                "0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
                "10", "20", "30", "40", "50", "60", "70", "80", "90",
                "100", "200", "300", "400", "500", "600", "700", "800", "900",
                "012345678910203040506070"
            ],
            "0123456789102030405060708090100200300400500600700800900"
        ),
        ## Example2
        ## Solution ['100']
        (
            ["0", "1", "10", "100"],
            "100"
        ),
        ## Example3
        ## Solution ['101234567891020304050', '6070809010020030040050', '0600700800900']
        (
            [
                "10", "20", "30", "40", "50", "60", "70", "80", "90",
                "012345678910203040506070",
                "101234567891020304050",
                "6070809010020030040050",
                "0600700800900"
            ],
            "10123456789102030405060708090100200300400500600700800900"
        ),
        ### Example4
        ### Solution ['12', '34', '56', '78', '90']
        (
            [
                "12", "34", "56", "78", "90",
                "890",
            ],
            "1234567890"
        ),
        ## Example5
        ## Solution ['12', '34']
        (
            [
                "1", "2", "3",
                "12", "23", "34"
            ],
            "1234"
        )
    ]

    for (choices, large_number) in examples:
        res = get_it_done(choices, large_number)
        print("{0}\n{1}\n{2} --> {3}".format(
            large_number, "".join(res), res, "".join(res) == large_number))
        print('-' * 80)

2
这是原始@turkus算法的1:1副本,这意味着什么? - BPL

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接