使用占位符替换字符串，在函数执行后再将其替换回来。

Question

使用占位符替换字符串，在函数执行后再将其替换回来。

6

给定一个字符串和一个应该替换为占位符的子字符串列表，例如：

import re
from copy import copy 

phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"

第一个目标是先用索引占位符替换phrases中的子字符串，例如在original_text中。

text = copy(original_text)
backplacement = {}
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)

[out]:

Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen

然后会有一些用于操作带有占位符的 text 的函数，例如：

cleaned_text = func('Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen')
print(cleaned_text)

输出结果为：

MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2

最后一步是以相反的方式进行替换，并放回原始短语，即：

' '.join([backplacement[tok] if tok in backplacement else tok for tok in clean_text.split()])

[out]:

"'s_morgen ik 's-Hertogenbosch depository_financial_institution"

以下是需要翻译的内容：

phrases中的子字符串列表很大时，执行第一个替换和最后一个回替换需要很长时间。

是否有一种方法可以使用正则表达式进行替换/回替换？

使用re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)正则表达式进行替换并不是很有帮助，特别是如果短语中有与完整单词不匹配的子字符串。

例如：

phrases = ["org", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)

我们得到了一个尴尬的输出：

Something, 's mMWEPHRASE0en, ik MWEPHRASE1 im das MWEPHRASE2 gehen

我曾尝试使用'\b{}\b'.format(phrase)，但对于带有标点符号的短语并不起作用，例如：

phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"\b{}\b".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)

[out]:

Something, 's morgen, ik 's-Hertogenbosch im das MWEPHRASE2 gehen

在re.sub正则表达式模式中，有没有地方可以标识短语的单词边界？

- alvas

在您所期望的输出中，除了ik之外，所有不在phrases中出现的字符串都被删除。为什么会这样呢？ - Ajax1234

你正在用较困难的方式进行这个操作。然后会有一些函数来处理带有占位符的文本。因此，你需要一个函数在添加完占位符后对文本进行操作。该函数必须对空格或其他内容进行拆分。现在，你有了一个数组，可以操作除占位符之外的所有元素，然后将其连接成字符串，最后再使用实际单词替换回占位符。正确吗？ - user557597

2

单次遍历，我会使用正则表达式将所有单词匹配并放入二维数组（或列表）中。第一维是字符串部分，第二维是标志位。当匹配到非短语字符串部分时，标志位为0；当匹配到短语单词时，标志位为1。然后，您可以迭代该数组并忽略标志位为1的元素。根据需要添加、删除和重新排列元素。最后将它们重新连接在一起。正则表达式很简单：((?:(?!phrase1|phrase2|phrase3)[\S\s])+)|(phrase1|phrase2|phrase3)。其中，捕获组1是非短语字符串部分，捕获组2是短语。 - user557597

这似乎是一个替代方案：https://github.com/vi3k6i5/flashtext - alvas

1

关于单词边界，您必须寻找 r"(?<!\w){}(?!\w)".format(phrase)。由于您的一些关键字以非单词字符开头，因此无法使用\b。您能否提供更多需要实现的逻辑？看起来您可能需要将回调/lambda作为第二个参数传递给re.sub，以便仅替换每个匹配项一次。 - Wiktor Stribiżew

你尝试过我的方法吗？或者现在想转换到FlashText吗？ - Wiktor Stribiżew

4个回答

2

我认为在处理这个任务时使用正则表达式有两个关键点：

使用自定义边界，捕获它们并将其与短语一起替换回来。
使用函数处理替换匹配，双向处理。

以下是采用此方法的实现。我稍微修改了您的文本以重复其中一个短语。

import re
from copy import copy 

original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen 's morgen"
text = copy(original_text)

#
# The phrases of interest
#
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]

#
# Create the mapping dictionaries
#
phrase_to_mwe = {}
mwe_to_phrase = {}

#
# Build the mappings
#
for i, phrase in enumerate(phrases):

    mwephrase                = "MWEPHRASE{}".format(i)
    mwe_to_phrase[mwephrase] = phrase.replace(' ', '_')
    phrase_to_mwe[phrase]    = mwephrase

#
# Regex match handlers
#
def handle_forward(match):

    b1     = match.group(1)
    phrase = match.group(2)
    b2     = match.group(3)

    return b1 + phrase_to_mwe[phrase] + b2


def handle_backward(match):

    return mwe_to_phrase[match.group(1)]

#
# The forward regex will look like:
#
#    (^|[ ])('s morgen|'s-Hertogenbosch|depository financial institution)([, ]|$)
# 
# which captures three components:
#
#    (1) Front boundary
#    (2) Phrase
#    (3) Back boundary
#
# Anchors allow matching at the beginning and end of the text. Addtional boundary characters can be
# added as necessary, e.g. to allow semicolons after a phrase, we could update the back boundary to:
#
#    ([,; ]|$)
#
regex_forward  = re.compile(r'(^|[ ])(' + '|'.join(phrases) + r')([, ]|$)')
regex_backward = re.compile(r'(MWEPHRASE\d+)')

#
# Pretend we cleaned the text in the middle
#
cleaned = 'MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2 MWEPHRASE0'

#
# Do the translations
#
text1 = regex_forward .sub(handle_forward,  text)
text2 = regex_backward.sub(handle_backward, cleaned)

print('original: {}'.format(original_text))
print('text1   : {}'.format(text1))
print('text2   : {}'.format(text2))

运行这个命令会生成：

original: Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen 's morgen
text1   : Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen MWEPHRASE0
text2   : 's_morgen ik 's-Hertogenbosch depository_financial_institution 's_morgen

- cryptoplex

1

这是一个你可以使用的策略：

phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"

# need this module for the reduce function
import functools as fn

#convert phrases into a dictionary of numbered placeholders (tokens)
tokens = { kw:"MWEPHRASE%s"%i for i,kw in enumerate(phrases) }

#replace embedded phrases with their respective token
tokenized = fn.reduce(lambda s,kw: tokens[kw].join(s.split(kw)), phrases, original_text)

#Apply text cleaning logic on the tokenized text 
#This assumes the placeholders are left untouched, 
#although it's ok to move them around)
cleaned_text = cleanUpfunction(tokenized)

#reverse the token dictionary (to map original phrases to numbered placeholders)
unTokens = {v:k for k,v in tokens.items() }

#rebuild phrases with original text associated to each token (placeholder)
final_text = fn.reduce(lambda s,kw: unTokens[kw].join(s.split(kw)), phrases, cleaned_text)

- Alain T.

1

你所需要的是称为“多字符串搜索”或“多模式搜索”。更常见的解决方案是Aho-Corasick和Rabin-Karp算法。如果您想自己实现它，请选择Rabin-Karp，因为它更容易掌握。否则，您可以找到一些库。这里有一个使用库https://pypi.python.org/pypi/py_aho_corasick的解决方案。

让

phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"

而且，为了测试目的：

（保留HTML标记）

def clean(text):
    """A simple stub"""
    assert text == 'Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen'
    return "MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2"

现在，您需要定义两个自动机，一个用于往程，另一个用于回程。自动机由（键，值）列表定义：

fore_automaton = py_aho_corasick.Automaton([(phrase,"MWEPHRASE{}".format(i)) for i, phrase in enumerate(phrases)])
back_automaton = py_aho_corasick.Automaton([("MWEPHRASE{}".format(i), phrase.replace(' ','_')) for i, phrase in enumerate(phrases)])

自动机将扫描文本并返回匹配列表。匹配是一个三元组（位置，键，值）。对于匹配，稍加处理即可通过值替换键：

def process(automaton, text):
    """Returns a new text, with keys of the automaton replaced by values"""
    matches = automaton.get_keywords_found(text.lower()) # text.lower() because auomaton of py_aho_corasick uses lowercase for keys
    bk_value_eks = [(i,v,i+len(k)) for i,k,v in matches] # (begin of key, value, end of key)
    chunks = [bk_value_ek1[1]+text[bk_value_ek1[2]:bk_value_ek2[0]] for bk_value_ek1,bk_value_ek2 in zip([(-1,"",0)]+bk_value_eks, bk_value_eks+[(len(text),"",-1)] if bk_value_ek1[2] <= bk_value_ek2[0]] # see below
    return "".join(chunks)

对于chunks = [bk_value_ek1[1]+text[bk_value_ek1[2]:bk_value_ek2[0]] for bk_value_ek1,bk_value_ek2 in zip([(-1,"",0)]+bk_value_eks, bk_value_eks+[(len(text),"",-1)] if bk_value_ek1[2] <= bk_value_ek2[0]]的简要解释。我使用了zip函数，它与普通情况下的匹配非常相似：zip(arr, arr[1:])将输出(arr[0], arr[1)), (arr[1], arr[2]), ...来考虑每一个匹配及其后继。这里我放置了两个哨兵来处理匹配的开头和结尾。

对于正常情况，我只输出值（=bk_value_ek1[1]）和键的结束位置到下一个键的开始位置之间的文本（text[bk_value_ek1[2]:bk_value_ek2[0]）。
开始标记没有值，其键在位置0结束，因此第一块将是 "" + text[0:begin of key1]，即第一个键之前的文本。
同样，结束标记也没有值，并且其键从文本末尾开始，因此最后一块将是：最后匹配的值+文本[end of the last key:len(text)]。

当键重叠时会发生什么？以一个例子为例：text="abcdef"，phrases={"bcd":"1", "cde":"2"}。你有两个匹配项：(1, "bcd", "1")和(2, "cde", "3")。让我们看看：bk_value_eks = [(1, "1", 4), (2, "2", 5)]。因此，如果没有if bk_value_ek1[2] <= bk_value_ek2[0]，文本将被替换为text[:1]+"1"+text[4:2]+"2"+text[5:]，即"a"+"1"+""+"2"+"f" = "a12f"而不是"a1ef"（忽略第二个匹配项）......

现在，看一下结果：

print(process(back_automaton, clean(process(fore_automaton, original_text))))
# "'s_morgen ik 's-Hertogenbosch depository_financial_institution"

你不需要为返回值定义一个新的process函数，只需将back_automaton传递给它即可完成工作。

- jferard

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Dmitry Arkhipenko · Accepted Answer

你可以使用拆分而不是使用re.sub！

def do_something_with_str(string):
    # do something with string here.
    # for example let's wrap the string with "@" symbol if it's not empty
    return f"@{string}" if string else string


def get_replaced_list(string, words):
    result = [(string, True), ]

    # we take each word we want to replace
    for w in words:

        new_result = []

        # Getting each word in old result
        for r in result:

            # Now we split every string in results using our word.
            split_list = list((x, True) for x in r[0].split(w)) if r[1] else list([r, ])

            # If we replace successfully - add all the strings
            if len(split_list) > 1:

                # This one would be for [text, replaced, text, replaced...]
                sub_result = []
                ws = [(w, False), ] * (len(split_list) - 1)
                for x, replaced in zip(split_list, ws):
                    sub_result.append(x)
                    sub_result.append(replaced)
                sub_result.append(split_list[-1])

                # Add to new result
                new_result.extend(sub_result)

            # If not - just add it to results
            else:
                new_result.extend(split_list)
        result = new_result
    return result


if __name__ == '__main__':
    initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
    words_to_replace = ('a', 'c')
    replaced_list = get_replaced_list(initial_string, words_to_replace)
    modified_list = [(do_something_with_str(x[0]), True) if x[1] else x for x in replaced_list]
    final_string = ''.join([x[0] for x in modified_list])

以下是上述示例中的变量值：

initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
words_to_replace = ('a', 'c')
replaced_list = [('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True)]
modified_list = [('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True)]
final_string = 'ac@bbc@bbcac@bbc@bbcac@bbc@bbca'

如您所见，这些列表包含元组。它们包含两个值 - 一些字符串 和 布尔值，表示它是文本还是替换值（当文本时为True）。获取替换列表后，您可以像示例中那样修改它，检查它是否是文本值（if x[1] == True）。希望这有所帮助！

P.S. 像 f"some string here {some_variable_here}" 这样的字符串格式化需要Python 3.6。