如何在伪多项式时间内从文本字符串表中找到字符串的最佳编码？

Question

如何在伪多项式时间内从文本字符串表中找到字符串的最佳编码？

3

问题：

考虑以下数据压缩技术。我们有一个包含m个文本字符串的表，每个字符串长度最多为k。我们想使用尽可能少的文本字符串来编码长度为n的数据字符串D。例如，如果我们的表包含（a，ba，abab，b），并且数据字符串是bababbaababa，则最佳编码方式是（b，abab，ba，abab，a）-总共五个代码单词。给出一种O（nmk）算法以找到最佳编码的长度。您可以假设每个字符串至少有一种在表中的编码。

我在skiena的“算法设计手册”书中发现了这个问题，并试图解决它。

我的最佳猜测是将字符串（D）的所有可能子字符串与长度为（m）的表中的所有文本进行匹配，时间复杂度：O（n * n * k * m）。

是否有更好的方法以O（nmk）伪多项式时间来解决这个问题？

- prakash sellathurai

4个回答

2

我们可以逐步地为每个句子前缀构建一个解决方案，通过尝试将一个单词附加到更小的前缀来最小化切割次数。

words = {'ba', 'a', 'abab', 'b'}
sentence = 'bababbaababa'

def smallest_cut(sentence, words):
    max_word_length = max(len(w) for w in words)
    words_by_length = [[] for i in range(max_word_length + 1)]
    for w in words:
        words_by_length[len(w)].append(w)

    results_by_length = dict()
    results_by_length[0] = []

    for i in range(len(sentence) + 1):
        for j in range(max(0, i-max_word_length), i):
            if j in results_by_length:
                for w in words_by_length[i-j]:
                    if sentence[j:i] == w:
                        if i not in results_by_length or len(results_by_length[j]) + 1 < len(results_by_length[i]):
                            results_by_length[i] = results_by_length[j] + [w]

    return results_by_length[len(sentence)]

print(smallest_cut(sentence, words))

第一个for循环执行了n次迭代，第二个和第三个循环共执行了m次迭代，比较sentence[j:i] == w需要k次操作。

- Sopel

@Stef，我也将我的原始变量名称更改为您的名称，以便在解决方案中实现可比性/一致性，并考虑采用您的函数名称。我也应该承认这一点吗？ - no comment

1

@Stef，我现在已经阅读了你的两个解决方案。没有抄袭（不包括输入数据和函数签名）。你们的解决方案非常不同。你的[甚至不正确]（请查看“输出”下面）。 - no comment

2

@Stef 哦，有趣。我以为tio.run会在链接中编码输出。但显然它将其存储在他们的服务器上并显示最新运行的输出。如果它显示正确结果（5个单词），那么运行几次直到你得到错误结果（6个单词）。您依赖于集合的迭代顺序，由于字符串哈希随机化而从运行到运行发生变化。 - no comment

1

@Stef 或者他们可以通过 w = sentence[j:i] 和检查 if w in words_by_length[i-j]:（后者应该是一个集合）来提高复杂性。我认为那样的话应该是 O((n+m)k)。 - no comment

2

不是真的 k<=m（因为它们是由输入“定义”的，而不是由我们对其进行操作），但你是对的，现实切片的数量也受到 m 的限制。所以... O(nk min(m,k))? - no comment

显示剩余13条评论

1

在Python中的备忘录解决方案。 solve(i) 计算前i个字母的结果，因此答案是 solve(n)。对于每个前缀长度 i，尝试所有单词作为该前缀的可能后缀。

由于有n个可能的 i 参数，每个检查m个单词，并且每个单词检查最多需要k个字母检查，所以时间复杂度为O(nmk)。

在线试用！

from functools import lru_cache

words = 'a', 'ba', 'abab', 'b'
sentence = 'bababbaababa'

@lru_cache
def solve(i):
    return i and min((solve(i - len(word)) + 1
                      for word in words
                      if sentence.endswith(word, None, i)),
                     default=float('inf'))

print(solve(len(sentence)))

- no comment

1

一个时间复杂度为O(nmk)的解决方案。

from collections import defaultdict
from typing import Set

def smallest_cut(sentence: str, words: Set[str]) -> int:
    lookup = [1e10 for i in range(len(sentence)+1)]
    step_results = defaultdict(set)
    step_results[0].add(sentence)
    curr_step = 1
    while step_results[curr_step-1]: # this line and next line has in fact O(n) because of 'lookup'
        for curr_sentence in step_results[curr_step-1]: # look to comment above
            for word in words: # O(m)
                if curr_sentence.startswith(word): # O(k)
                    new_sentence = curr_sentence[len(word):]  # remove prefix
                    if new_sentence:
                        if lookup[len(new_sentence)] > curr_step:
                            lookup[len(new_sentence)] = curr_step  # cheat with lookup for O(1) complexity instead O(log(n))
                            step_results[curr_step].add(new_sentence)
                    else: # we found shotrest encoding
                        return curr_step
        curr_step += 1

smallest_cut('bababbaababa', {'ba', 'a', 'abab', 'b'})

- quester

1

请问您能否编辑您的答案并加上解释？当我阅读您的代码时，我有几个问题。最重要的问题是：您的代码与我的基本不同在哪里？在我看来，您只是将我的代码中的“dict”替换为“defaultdict”，将“set”替换为“typing.Set”，并将“for _ in range(max_iterations):”替换为“while True:”。您如何得出您的代码是O(n m k)的结论？ - Stef

请查看新版本，有一个可以改进的地方是使用 trie 削减前缀。此外，while 和另一个行的复杂度为 O(n)，因为 steps_results 将保存最大 n 个句子，因为 lookup 的技巧确保了它。 - quester

尝试在您的解决方案上运行此代码

print(smallest_cut('bababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbbababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbaabababababbbababbaabababababbaabababababbaabababababbaabababababbaababababab', {'a', 'b', 'abab', 'ba'}))

216 - quester

@Stef 很有趣，我们在争论/打架谁是对的，而 OP 却坐在后面看着哈哈大笑 :) - quester

我询问的不只是争论 - 很明显，如果我的代码是n^2 m k而不是n m k，那么它肯定存在某些低效的地方。感谢您提供的测试用例，我会仔细研究它。 - Stef

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Kevin Montambault · Accepted Answer

如果我误解了，请纠正我，但是听起来你的解决方案的时间复杂度比O(nnkm)要高。

生成包含D中所有元素的所有可能划分列表需要O(nn)的时间。然后，对于每个划分中的每个子集（最多有n个子集），您需要验证是否存在与表匹配的项，这将需要每个子集的O(km)的时间。在n * n个可能的分区之后（实际上不是n * n，而是n的贝尔数），匹配每个最多包含n个集合的复杂度为O(nnnmk)。

话虽如此，我认为可以通过尝试构建一棵树来更快地完成它。

从D的开头开始，您可以尝试匹配表中的所有字符串。表中的每个字符串表示一个分支，如果表字符串与D的第一个位置不匹配，则该分支终止。如果匹配，则该过程在刚刚匹配的表字符串的末尾重新开始。

使用此方法的诀窍是，您只需要保留产生唯一长度的匹配序列。以广度优先的方式重复此过程意味着，如果已经发现了达到特定长度的一组表字符串，则再次达到该长度的任何表字符串集都保证包含来自表的相同数量或更多的字符串，但仍匹配D的相同子串。例如，表字符串{abab，ab}将都生成长度为4的字符串“abab”，但其中一个将在第一次迭代时匹配，而另一个将在第二次迭代时匹配。在此示例中，包含'ab'+'ab'的编码始终比包含'abab'的编码差，因此我们丢弃了'ab'+'ab'的编码。可以使用简单的O(1)查找表来检查这一点，并且任何已经看到的长度也可以终止其分支。因此，最多可以发现n个唯一长度，每个长度只能发现一次，这意味着最多可能有n个分支。

仍然必须检查表中的字符串，这仍然是O(mk)。加入分支后，我相信这会产生O(nmk)的复杂度，因为最多创建n个分支。

例如，在字符串'aaaaaa'上的表{a，aa}将每个迭代包含2个分支，共3次迭代。每个分支需要O(mk)的时间进行检查，因此此示例的时间复杂度为O(nmk)。

编辑：由于对我的提议实现存在一些怀疑，我已经在C中实现了它并进行了一些分析。


#include <stdlib.h>
#include <string.h>
#include <stdio.h>

const char* alphabet = "abcdef";
const int alphabetLen = strlen(alphabet);

void printTable(char** table, int k){
    printf("Table: {%s", table[0]);
    for(int i=1; i<k; i++){
        printf(", %s", table[i]);
    }
    printf("}\n");
}

char** generateTable(int k, int m){

    char** table = (char**)malloc(sizeof(char*)*k);
    for(int i=0; i<k; i++){
        int tableStringLen = (rand() % (m-1)) + 1;
        table[i] = (char*)malloc(tableStringLen+1);
        for(int j=0; j<tableStringLen; j++){
            table[i][j] = alphabet[rand() % alphabetLen];
        }
        table[i][tableStringLen] = 0;
    }

    return table;
}

// Unfortunately, the length of the string is partially determined by the size of m
// it's done this way to avoid generating a random string and ensuring the table can encode it
char* generateString(char** table, int k, int m){
    int minLength = rand()%200 + m*2;
    char* string = (char*)malloc(minLength + m);

    int j = 0;
    for(int i=0; j<minLength; i++){
        int tableStringIndex = rand() % k;
        strcpy(string+j, table[tableStringIndex]);
        j += strlen(table[tableStringIndex]);
    }
    string[j] = 0;

    return string;
}

void printSolution(char* string, int* branchLengths, int n){
    if(!n){ return; }
    printSolution(string, branchLengths, branchLengths[n]);
    printf("%.*s ", n-branchLengths[n], string+branchLengths[n]);
}

int* solve(char* string, char** table, int n, int m, int k, int* comparisons){
    int* currentBranchEnds = (int*)calloc(sizeof(int), n); // keeps track of all of the current ends
    int currentBranchCount = 1; // keeps track of how many active branch ends there are

    int* newBranchEnds = (int*)calloc(sizeof(int), n); // keeps track of the new branches on each iteration
    int newBranchCount = 0;

    int* branchLengths = (int*)calloc(sizeof(int), (n+1)); // used to reconstruct the solution in the end

    // used for O(1) length lookups
    int* tableStringLengths = (int*)malloc(sizeof(int) * k);
    for(int i=0; i<k; i++){ tableStringLengths[i] = strlen(table[i]); }

    *comparisons = 0;

    // continue searching while the entire string hasn't been encoded
    while(!branchLengths[n]){

        // for every active branch
        for(int i=0; i<currentBranchCount; i++){


            // try all table strings
            for(int j=0; j<k; j++){
                int newBranchEnd = currentBranchEnds[i] + tableStringLengths[j];
                
                // if the new length (branch end) already exists OR
                // if the new branch would be too long for the string, discard the branch
                *comparisons += 1;
                if(newBranchEnd > n || branchLengths[newBranchEnd]){ continue; }

                // check to see if the table string matches the target string at this position
                // could be done with strcmp, but is done in a loop here to be explicit about complexity
                char match = 1;
                for(int l=0; table[j][l]; l++){
                    *comparisons += 1;
                    if(string[currentBranchEnds[i] + l] != table[j][l]){
                        match = 0;
                        break;
                    }
                }

                // if it matches, we can create a new branch at this position
                *comparisons += 1;
                if(match){
                    branchLengths[newBranchEnd] = currentBranchEnds[i];
                    newBranchEnds[newBranchCount] = newBranchEnd;
                    newBranchCount += 1;
                }
            }
        }

        // swap the branch ends arrays to save copying
        int* tmp = currentBranchEnds;
        currentBranchEnds = newBranchEnds;
        newBranchEnds = tmp;

        currentBranchCount = newBranchCount;
        newBranchCount = 0;
    }

    free(currentBranchEnds);
    free(newBranchEnds);
    free(tableStringLengths);

    return branchLengths;
}

int main(){
    int k = rand() % 30 + 2;
    int m = rand() % 15 + 2;

    char** table = generateTable(k, m);
    printTable(table, k);

    char* string = generateString(table, k, m);
    int n = strlen(string);
    printf("String: %s\n", string);

    int comparisons;
    int* solution = solve(string, table, n, m, k, &comparisons);
    printf("Solution: ");
    printSolution(string, solution, n);
    printf("\n");
    printf("Comparisons: %d\n", comparisons);

    for(int i=0; i<k; i++){ free(table[i]); }
    free(table);
    free(solution);
    free(string);
}

通过对 n、m 和 k 的随机生成的值运行算法500000次，生成了该分析结果。在代码中，每遇到一个 if 语句就会增加“比较”的数量，这个数量会随着变量n、m和k的值而改变，平均每个 n、m 和 k 值的比较次数被绘制成图表。

很明显，比较次数（计算出来的）与 n、m 和 k 的大小之间存在线性关系。这表明复杂度为 O(nmk)。