如何将给定的文本从字典中拆分成单词？

Question

如何将给定的文本从字典中拆分成单词？

stringalgorithmlanguage-agnosticdata-structures

16

这是一道面试题。假设您有一个字符串text和一个dictionary（一组字符串）。如何将text分解为子字符串，使得每个子字符串都在dictionary中找到。

例如，您可以使用/usr/share/dict/words将"thisisatext"分解为["this", "is", "a", "text"]。

我认为回溯法可以解决这个问题（伪Java代码）：

void solve(String s, Set<String> dict, List<String> solution) {
   if (s.length == 0)
      return
   for each prefix of s found in dict
      solve(s without prefix, dict, solution + prefix)
}
List<String> solution = new List<String>()
solve(text, dict, solution)

这个算法有意义吗？您会优化在字典中搜索前缀的步骤吗？您会推荐哪些数据结构？

- Michael

1

请纠正我，但是您的解决方案不是多项式的。使用trie和DP可以在O（n ^ 2）的时间内解决这个问题（实际上是O（k），其中k是字典中最长单词的长度）。如果您需要答案，请告诉我。 - ElKamina

@ElKamina 谢谢。我想听听动态规划的解决方案。 - Michael

4个回答

5

这个解决方案假设字典中存在Trie数据结构。此外，对于Trie中的每个节点，假设具有以下功能：

node.IsWord()：如果到该节点的路径是一个单词，则返回true
node.IsChild(char x)：如果存在标签为x的子节点，则返回true
node.GetChild(char x)：返回标签为x的子节点

Function annotate( String str, int start, int end, int root[], TrieNode node):
i = start
while i<=end:
    if node.IsChild ( str[i]):
        node = node.GetChild( str[i] )
        if node.IsWord():
            root[i+1] = start
        i+=1
    else:
        break;

end = len(str)-1
root = [-1 for i in range(len(str)+1)]
for start= 0:end:
    if start = 0 or root[start]>=0:
        annotate(str, start, end, root, trieRoot)

index  0  1  2  3  4  5  6  7  8  9  10  11
str:   t  h  i  s  i  s  a  t  e  x  t
root: -1 -1 -1 -1  0 -1  4  6 -1  6 -1   7

我将让您负责通过反向遍历根节点列出组成字符串的单词部分。

时间复杂度为O(nk)，其中n是字符串长度，k是字典中最长单词的长度。

PS：我假设字典中包含以下单词：this，is，a，text，ate。

- ElKamina

1

根节点不需要是一个列表的数组吗？否则，您将会失去在同一位置汇聚的多个字符串路径。 - Timothy Jones

@TimothyJones 我认为海报想要一个解决方案，而不是所有解决方案。你是对的，通过拥有一个列表，您可以打印出形成字符串的所有单词组合。 - ElKamina

1

也许将时间复杂度加入到你的答案中会更有价值 - 我认为它是O(n.k)，其中n是文本的大小，k是字典中最长的单词。(再加上读取最终路径所需的时间 - 在非多路径版本中，我认为它是O(m)，其中m是结果中单词的数量)。 - Timothy Jones

如果“str”包含未定义的单词，请删除此行：“if start = 0 or root[start]>=0:”
只有当“root[i+1] == -1”时，才将“root[i+1] = start”用于返回最长匹配。

- Lenik

我认为张贴者想要一个解决方案，而不是所有解决方案。我对张贴者的问题的理解：将字符串分解为子字符串，这些子字符串将覆盖整个原始字符串（没有剩余字母）。例如：beerevery-在这种情况下，您的解决方案可能首先具有root [9] = 4的值，然后稍后覆盖root [9] = 5; beer | every是正确的拆分; 但是beere | very是错误的拆分; 我认为您上面的算法在当前状态下将失败（即无法重构完整的、正确的字符串拆分）... - ChaimKut

显示剩余4条评论

4

方法1 - Trie看起来是一个很好的选择。生成英语词典单词的trie树。这种trie构建只需要一次成本。在trie树建立后，可以通过逐个字母比较轻松地比较string。如果在trie树上遇到叶子节点，可以假设找到了一个单词，并将其添加到列表中，然后继续遍历。在到达string末尾之前进行遍历。输出列表。

搜索时间复杂度为-O(word_length)。

空间复杂度为-O(charsize * word_length * no_words)。您的词典大小。

方法2-我听说过Suffix Trees，虽然从没用过，但在这里可能很有用。

方法3-更加繁琐且不佳的替代方案。您已经提出了这个建议。

你可以尝试相反的方式。运行dict检查是否存在子字符串匹配。这里假设dict中的键是英语词典/usr/share/dict/words的words。因此，伪代码如下所示：

(list) splitIntoWords(String str, dict d)
{
    words = []
    for (word in d)
    {
        if word in str
            words.append(word);
    }
    return words;
}

复杂度 - 遍历整个字典的时间复杂度为O(n)，查找子字符串的时间复杂度为O(1)。

空间 - 最坏情况下，如果len(words) == len(dict)，则空间复杂度为O(n)。

正如其他人指出的那样，这确实需要回溯。

- Srikar Appalaraju

4

你仍需处理回溯（backtracking），对吧？如果你的字典中包含“the”和“these”，那么输入“thesebugs”和“thesets”会导致问题。 - Adrian McCarthy

1

这似乎只能找到出现在字符串中的单词。问题中还有一个额外的条件 - 单词必须覆盖整个字符串而不重叠。 - Rafał Dowgird

2

您可以使用动态规划和哈希来解决这个问题。

计算字典中每个单词的哈希值。使用您最喜欢的哈希函数。我会使用类似于(a1 * B ^ (n - 1) + a2 * B ^ (n - 2) + ... + an * B ^ 0) % P的东西，其中a1a2...an是一个字符串，n是字符串的长度，B是多项式的基数，P是一个大的质数。如果您有字符串a1a2...an的哈希值，您可以在常数时间内计算出字符串a1a2...ana(n+1)的哈希值：(hashValue(a1a2...an) * B + a(n+1)) % P。

这部分的复杂度为O(N * M)，其中N是字典中单词的数量，M是字典中最长单词的长度。

然后，使用像这样的DP函数：

   bool vis[LENGHT_OF_STRING];
   bool go(char str[], int length, int position)
   {
      int i;

      // You found a set of words that can solve your task.
      if (position == length) {
          return true;
      }

      // You already have visited this position. You haven't had luck before, and obviously you won't have luck this time.
      if (vis[position]) {
         return false;
      }
      // Mark this position as visited.
      vis[position] = true;

      // A possible improvement is to stop this loop when the length of substring(position, i) is greater than the length of the longest word in the dictionary.
      for (i = position; position < length; i++) {
         // Calculate the hash value of the substring str(position, i);
         if (hashValue is in dict) {
            // You can partition the substring str(i + 1, length) in a set of words in the dictionary.
            if (go(i + 1)) {
               // Use the corresponding word for hashValue in the given position and return true because you found a partition for the substring str(position, length).
               return true;
            }
         }
      }

      return false;
   }

该算法的复杂度为O(N * M)，其中N是字符串的长度，M是字典中最长单词的长度，或者取决于您是否编码了改进，可能为O(N ^ 2)。

因此，算法的总复杂度将是：O(N1 * M) + O(N2 * M)（或O(N2 ^ 2)，其中N1是字典中的单词数，M是字典中最长单词的长度，N2是字符串的长度）。

如果您想不出一个好的哈希函数（没有任何冲突），另一种可能的解决方案是使用Tries或Patricia trie（如果正常trie的大小非常大）（我无法发布这些主题的链接，因为我的声望不足以发布超过2个链接）。但是，如果您使用此方法，则算法的复杂度将为O(N * M) * O(在trie中查找单词所需的时间)，其中N是字符串的长度，M是字典中最长单词的长度。

希望能对您有所帮助，对我的糟糕英语表示歉意。

- David Eleazar Gómez Cemeño

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Rob Neuhaus · Accepted Answer

这个问题的解决方案在这篇博客文章中详细介绍。

基本思路是对您编写的函数进行备忘录，并且您将拥有一个O（n ^ 2）时间，O（n）空间算法。