在一个字符串中高效地进行多个替换

Question

在一个字符串中高效地进行多个替换

7

人们曾经讨论过如何基于字典对字符串进行多次替换（例如，请参见这个问题）。似乎有一组基于string.replace和一组基于正则表达式的选项，还有几种其他选项。

但是我对不同方法在字典大小方面的效率感兴趣，我发现这对效率有非常重要的影响。

my_subs = {'Hello world': 'apple', 'is': 'ship'}
string = 'Hello world! This is nice.'
new_string = my_efficient_method(string, my_subs)

明确一点，本问题不涉及如何进行这些替换，而是关于在何种条件下哪种方法更有效，并且适用哪些注意事项。特别是当有许多（>100k）长度较长（10-20k个字符）的字符串和庞大的字典（>80k对）时，在这些情况下，基于正则表达式的首选方法表现非常差。

- Pablo

如果子串有重叠，答案应该怎么做？例如 "abc"，{"ab":"x","bc":"y"}？如果替换输出可能是另一个替换输入，答案应该怎么做？例如 "a"，{"a":"b","b":"a"}？ - tsh

你已经给出了替换字典的大小，但没有测试字符串。根据输入字符串的大小，最快的算法可能会有所不同。 - tsh

好的观点。文档长度各不相同，但在我的情况下，它们大约是“几页”的数量级。我不知道为什么我没有在这里包含这些信息。关于另一个问题：在我的特定示例中，那些情况实际上不会发生（当考虑“整个单词”时），我想根据您对这些需求的具体需求指出不同的最佳算法可能更公平。 - Pablo

当以“整个单词”思考时，您只需要将"abc", {"ab":"x","bc":"y"}更改为"there is a", {"there is":"there was","is a":"isn't a"}。否则，一个简单的split -> lookup -> join方法将具有更好的性能。 - tsh

2个回答

1

通过将替换表编译成有限状态机，您可以在单个字符扫描中完成替换。如果您要在许多不同的字符串上应用相同的替换表，则使用预编译的FSM会提高性能。

与其他方法相比：

- FSM类似于Trie：非叶节点的替换值也是预先计算的。因此，您无需回溯Trie以节省更多时间。 - FSM类似于预编译的正则表达式：因为正则表达式只是FSM的友好形式。 - FSM也类似于replace方法：使用KMP算法搜索子字符串实际上就是一个FSM。我们只是将多个FSM构建到一个单一的FSM中。

compile的时间复杂度为O(mk^2)，其中m是替换表的条目数，k是单个替换规则的长度。 replace的时间复杂度为O(n+n')，其中n是输入的长度，n'是输出的长度。

需要澄清的是：在下面的代码中，当多个规则匹配相同的字符时，应用最早开始的规则。如果存在平局，则应用最长的规则。

import typing

mismatch = ''

ConvertTree = typing.Dict[str, typing.Tuple[str, 'ConvertTree']]
def compile(rulePairs:typing.Iterable[typing.Tuple[str, str]]) -> ConvertTree:
    # Build Trie
    root: ConvertTree = {}
    for src, dst in rulePairs:
        current = root
        for ch in src:
            if ch not in current:
                node: ConvertTree = {}
                node[mismatch] = None
                current[ch] = ('', node)
            current = current[ch][1]
        if not current.get(mismatch, None):
            current[mismatch] = (dst, root)
        else:
            old_dst = current[mismatch][0]
            if dst != old_dst:
                raise KeyError(f"Found conflict rules:\n  * {src} -> {old_dst}\n  * {src} -> {dst}")

    # Fill non-leaf nodes
    node_with_suffix: typing.List[typing.Tuple[ConvertTree, str, str]] = [(root, '', '')]
    for node, output, suffix in node_with_suffix:
        node_output, node_suffix = output, suffix;
        if node.get(mismatch, None):
            leaf = node[mismatch]
            node_output, node_suffix = leaf[0], ''
        for key in node:
            if key == mismatch: continue
            val = node[key]
            next_node = val[1]
            if next_node is root: continue
            if len(next_node) == 1:
                node[key] = next_node[mismatch]
            else:
                node_with_suffix.append((next_node, node_output, node_suffix + key))
        if not node_suffix: continue
        node_next = root
        for ch in node_suffix:
            while node_output != '' and ch not in node_next and node_next is not root:
                ref_output, node_next = node_next[mismatch]
                node_output += ref_output
            if not node_output:
                node_output += ch
            elif ch in node_next:
                ref_output, node_next = node_next[ch]
                node_output += ref_output
                break
            elif node_next is root:
                node_output += ch
                break
        node[mismatch] = (node_output, node_next)

    return root

def replace(input_text: str, root: ConvertTree) -> str:
    current = root
    output_arr = []
    for ch in input_text:
        while True:
            try:
                output, current = current[ch]
                output_arr.append(output)
            except:
                if current is root:
                    output_arr.append(ch)
                else:
                    output, current = current[mismatch]
                    output_arr.append(output)
                    continue
            break
    while current is not root:
        output, current = current['']
        output_arr.append(output)
    return ''.join(output_arr)

my_subs = [('Hello world', 'apple'), ('is', 'ship')]
string = 'Hello world! This is nice.'
new_string = replace(string, compile(my_subs))
print(new_string) # "apple! Thship ship nice."

如果您发现了我的代码中有任何错误，请留言告诉我。如果有任何问题，我会尽力修复。

我在小说阅读器上使用了同样的算法，使用字符串替换来转换简体和繁体中文。相关的Python代码也可以在我的GitHub仓库（中文）中找到。

- tsh

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Pablo · Accepted Answer

如前所述，有不同的方法，每种方法都有不同的优点。我使用三种不同的情况进行比较。

短字典（847个替换对）
中等字典（2528个替换对）
长字典（80430个替换对）

对于字典1和2（较短的字典），我在循环中重复每种方法50次，以获得更一致的时间。 对于较长的字典，一个文档的单次通过需要足够长的时间（可悲的是）。我使用Python 3.8在service tio上测试了1和2。长字典在我的笔记本电脑上使用Python 3.6进行了测试。只有方法之间的相对性能才是相关的，因此细节并不重要。

我的字符串长度在28k和29k之间。

所有给出的时间以秒为单位。

更新：Flashtext

一位同事发现了Flashtext，这是一个专门用于此类问题的Python库。它允许通过查询进行搜索，并应用替换。它比其他替代方案快两个数量级。在实验3中，我目前最好的时间是1.8秒。而Flashtext只需要0.015秒。

正则表达式

有很多变体，但最好的往往非常类似于这个：

import re
rep = dict((re.escape(k), v) for k, v in my_dict.items())
pattern = re.compile("|".join(rep.keys()))
new_string = pattern.sub(lambda m: rep[re.escape(m.group(0))], string)

执行时间为：

1.63
5.03
7.7

替换

这种方法只是在循环中简单地应用了 string.replace。（稍后我会谈论其中的问题。）

for original, replacement in self.my_dict.items():
    string = string.replace(original, replacement)

这个解决方案提出了一种使用reduce的变体，它可以迭代地应用Lambda表达式。最好通过官方文档中的示例来理解。该表达式

reduce(lambda x, y: x+y, [1, 2, 3, 4, 5])

等于 ((((1+2)+3)+4)+5)

import functools
new_string = functools.reduce(lambda a, k: a.replace(*k), 
                              my_dict.items(), string)

Python 3.8允许使用赋值表达式，例如这种方法。在其核心中，这也依赖于string.replace。

[string := string.replace(f' {a} ', f' {b} ') for a, b in my_dict.items()]

执行时间（括号内为使用reduce和赋值表达式的结果）：

1.37（1.39）（1.50）
4.10（4.12）（4.07）
1.9（1.8）（机器中没有Python 3.8）

递归 Lambda

这个提议涉及使用递归 Lambda。

mrep = lambda s, d: s if not d else mrep(s.replace(*d.popitem()), d)
new_string = mrep(string, my_dict)

执行时间为：

0.07
RecursionError
RecursionError

实用备注

请参见上面的更新：Flashtext比其他替代方案快得多。

从执行时间可以看出，递归方法显然是最快的，但它只适用于小型字典。在Python中不建议增加递归深度，因此这种方法对于更长的字典完全被舍弃。

正则表达式可提供更多控制替换的选项。例如，您可以在元素前或后使用\b来确保目标子字符串的那一侧没有单词字符（以防止{'a': '1'}被应用于'apple'）。代价是性能急剧下降，比其他选项慢近四倍。

赋值表达式、reduce 和简单的循环 replace 提供了类似的性能（无法使用较长的字典测试赋值表达式）。考虑到可读性，string.replace 似乎是最好的选择。与正则表达式相比，问题在于替换是按顺序进行的，而不是一次性完成的。因此，对于字符串 'a'，{'a': 'b', 'b': 'c'} 返回 'c'。Python 中现在有序字典（但您可能希望继续使用 OrderedDict），因此您可以仔细设置替换的顺序以避免出现问题。当然，对于 80k 个替换，您不能依靠这一点。

我目前正在使用带有 replace 的循环，并进行一些预处理以最小化麻烦。我在标点符号两侧添加空格（包含标点符号的项也在字典中），然后可以搜索由空格包围的子字符串，并插入带有空格的替换。当您的目标是多个单词时，这也适用：

string = 'This is: an island'
my_dict = {'is': 'is not', 'an island': 'a museum'}

使用replace和正则表达式，我得到了string = ' This is : an island '，这样我的替换循环就可以进行了。

for original, replacement in self.my_dict.items():
    string = string.replace(f' {original} ', f' {replacement} ')

返回'这不是博物馆'，如预期。请注意，“This”和“island”中的“is”未更改。可以使用正则表达式将标点符号修复回来，但我不需要这一步骤。