如何使用Python正则表达式找到并替换句子中第n个单词的出现？

Question

如何使用Python正则表达式找到并替换句子中第n个单词的出现？

19

仅使用Python正则表达式，如何在句子中查找和替换第n个单词的出现？例如：

str = 'cat goose  mouse horse pig cat cow'
new_str = re.sub(r'cat', r'Bull', str)
new_str = re.sub(r'cat', r'Bull', str, 1)
new_str = re.sub(r'cat', r'Bull', str, 2)

我有一个句子，在这个句子中单词“cat”出现了两次。我想把第二个“cat”改成“Bull”，保留第一个“cat”。我的最终句子应该是："cat goose mouse horse pig Bull cow"。在我的代码中，我尝试了三次但未能得到我想要的结果。

- juggernaut

我认为最好的方法是将字符串拆分，计算“cat”的出现次数，并返回一个修改后的列表，其中第n个被替换。这可能会慢一些，但这可能并不重要，而且肯定比复杂的正则表达式更易读。 - Noufal Ibrahim

9个回答

8

我使用一个简单的函数，列出所有出现的情况，选择第n个位置并将其用于将原始字符串拆分为两个子字符串。然后它替换第一个出现在第二个子字符串中，并将子字符串连接回新字符串。

import re

def replacenth(string, sub, wanted, n):
    where = [m.start() for m in re.finditer(sub, string)][n-1]
    before = string[:where]
    after = string[where:]
    newString = before + after.replace(sub, wanted, 1)
    print newString

对于这些变量：

string = 'ababababababababab'
sub = 'ab'
wanted = 'CD'
n = 5

输出：

ababababCDabababab

注：

where 变量实际上是匹配位置的列表，您可以选择第 n 个位置。但是列表项索引通常从 0 开始，而不是从 1 开始。因此有一个 n-1 索引，n 变量是第 n 个子字符串。我的示例查找第五个字符串。如果您使用 n 索引并想要找到第五个位置，则需要将 n 设为 4。通常取决于生成我们的 n 的函数。

这应该是最简单的方法，但它不仅仅是您最初想要的正则表达式。

附加来源和一些链接：

where构造：如何查找子字符串的所有出现次数？

字符串分割：https://www.daniweb.com/programming/software-development/threads/452362/replace-nth-occurrence-of-any-sub-string-in-a-string

类似问题：在字符串中查找第n个子字符串的出现次数

- aleskva

谢谢！我认为你需要重新分配如下：after=after.replace(sub, wanted, 1)。我不认为它会在原地改变。（函数定义后也要加冒号） - campo

4

以下是一种无需使用正则表达式的方法：

def replaceNth(s, source, target, n):
    inds = [i for i in range(len(s) - len(source)+1) if s[i:i+len(source)]==source]
    if len(inds) < n:
        return  # or maybe raise an error
    s = list(s)  # can't assign to string slices. So, let's listify
    s[inds[n-1]:inds[n-1]+len(source)] = target  # do n-1 because we start from the first occurrence of the string, not the 0-th
    return ''.join(s)

使用方法：

In [278]: s
Out[278]: 'cat goose  mouse horse pig cat cow'

In [279]: replaceNth(s, 'cat', 'Bull', 2)
Out[279]: 'cat goose  mouse horse pig Bull cow'

In [280]: print(replaceNth(s, 'cat', 'Bull', 3))
None

- inspectorG4dget

这是唯一一个对我的情况有效的答案。 - WalksB

2

我会定义一个适用于所有正则表达式的函数：

import re

def replace_ith_instance(string, pattern, new_str, i = None, pattern_flags = 0):
    # If i is None - replacing last occurrence
    match_obj = re.finditer(r'{0}'.format(pattern), string, flags = pattern_flags)
    matches = [item for item in match_obj]
    if i == None:
        i = len(matches)
    if len(matches) == 0 or len(matches) < i:
        return string
    match = matches[i - 1]
    match_start_index = match.start()
    match_len = len(match.group())

    return '{0}{1}{2}'.format(string[0:match_start_index], new_str, string[match_start_index + match_len:])

一个可工作的示例：

str = 'cat goose  mouse horse pig cat cow'
ns = replace_ith_instance(str, 'cat', 'Bull', 2)
print(ns)

输出结果:

cat goose  mouse horse pig Bull cow

另一个例子：

str2 = 'abc abc def abc abc'
ns = replace_ith_instance(str2, 'abc\s*abc', '666')
print(ns)

输出结果：

abc abc def 666

- SomethingSomething

1

如何用 word 替换第 nth 个 needle:

s.replace(needle,'$$$',n-1).replace(needle,word,1).replace('$$$',needle)

- chvsanchez

这个问题（来自2014年）明确要求使用Python正则表达式，并且有一个被用户接受的答案 - 这并没有改进那个答案。 - Jake

1

只是因为当前的回答都不符合我的需求：基于aleskva的回答：

import re

def replacenth(string, pattern, replacement, n):
    assert n != 0
    matches = list(re.finditer(pattern, string))
    if len(matches) < abs(n) :
        return string
    m = matches[ n-1 if n > 0 else len(matches) + n] 
    return string[0:m.start()] + replacement + string[m.end():]

它接受负匹配数字（n = -1将返回最后一个匹配项），任何正则表达式模式，而且效率高。如果少于n个匹配项，则返回原始字符串。

- leonbloy

1

太理想了！我正准备在注意到你的回答之前发布一个类似的功能。唯一需要改变的是遵循re模块的标准。例如：def sub_nth(pattern, repl, string, n): - Bryan Roach

0

您可以匹配两个“cat”的出现，保留第二次出现之前的所有内容（\1），并添加“Bull”：

new_str = re.sub(r'(cat.*?)cat', r'\1Bull', str, 1)

我们只进行一次替换，以避免替换“cat”的第四、第六等出现（当至少有四次出现时），正如Avinash Raj评论所指出的。

如果你想要替换第n次出现而不是第二次，使用以下方法：

n = 2
new_str = re.sub('(cat.*?){%d}' % (n - 1) + 'cat', r'\1Bull', str, 1)

顺便说一句，你不应该将str作为变量名使用，因为它是Python的保留关键字。

- Pierre

1

请注意，如果输入是 cat cat cat goose mouse cat，您的解决方案将失败。 - Avinash Raj

那你为什么把“str”作为变量名呢？ - Avinash Raj

@Avinash Raj：我已经使用了（并未影响）问题中使用的变量。 - Pierre

0

创建一个 repl 函数以传递给 re.sub()。但是... 技巧在于将其制作成类，这样您就可以跟踪调用计数。

class ReplWrapper(object):
    def __init__(self, replacement, occurrence):
        self.count = 0
        self.replacement = replacement
        self.occurrence = occurrence
    def repl(self, match):
        self.count += 1
        if self.occurrence == 0 or self.occurrence == self.count:
            return match.expand(self.replacement)
        else:
            try:
                return match.group(0)
            except IndexError:
                return match.group(0)

然后像这样使用：

myrepl = ReplWrapper(r'Bull', 0) # replaces all instances in a string
new_str = re.sub(r'cat', myrepl.repl, str)

myrepl = ReplWrapper(r'Bull', 1) # replaces 1st instance in a string
new_str = re.sub(r'cat', myrepl.repl, str)

myrepl = ReplWrapper(r'Bull', 2) # replaces 2nd instance in a string
new_str = re.sub(r'cat', myrepl.repl, str)

我相信有更聪明的方法来避免使用类，但这似乎足够简单明了。此外，请确保返回match.expand()，因为仅返回替换值在技术上不正确，如果有人决定使用\1类型的模板。

- woot

0

我通过生成相对于整个字符串的所需捕获模式的“分组”版本来处理这个问题，然后直接将子表达式应用于该实例。

父函数是regex_n_sub，并收集与re.sub()方法相同的输入。

捕获模式通过将实例编号传递给get_nsubcatch_catch_pattern()。在内部，列表推导式生成多个模式'.*?（匹配任何字符，0或多次重复，非贪婪）。此模式将用于表示捕获模式的前n个出现之间的空格。

接下来，输入的捕获模式被放置在每个“空格模式”的第n个位置，并用括号括起来形成第一组。

第二组只是用括号括起来的捕获模式-因此当两个组合并时，就会创建一个“所有文本直到第n次出现捕获模式”的模式。这个“new_catch_pattern”内置了两个组，因此可以替换包含第n次出现的捕获模式的第二个组。

替换模式被传递给get_nsubcatch_replace_pattern()函数，并与前缀r'\g<1>'组合形成模式\g<1> + replace_pattern。该模式中的\g<1>部分定位到捕获模式中的第1组，并用替换模式中的文本替换该组。

下面的代码仅为了更清晰地理解流程而冗长，可以根据需要进行简化。

--

以下示例应该可以独立运行，并将第4个"I"更正为"me":

"当我去公园而且我一个人的时候，我觉得鸭子在嘲笑我，但我不确定。"

使用以下内容进行更正:

"当我去公园而且我一个人的时候，我觉得鸭子在嘲笑我，但我不确定。"

import regex as re

def regex_n_sub(catch_pattern, replace_pattern, input_string, n, flags=0):
    new_catch_pattern, new_replace_pattern = generate_n_sub_patterns(catch_pattern, replace_pattern, n)
    return_string = re.sub(new_catch_pattern, new_replace_pattern, input_string, 1, flags)
    return return_string

def generate_n_sub_patterns(catch_pattern, replace_pattern, n):
    new_catch_pattern = get_nsubcatch_catch_pattern(catch_pattern, n)
    new_replace_pattern = get_nsubcatch_replace_pattern(replace_pattern, n)
    return new_catch_pattern, new_replace_pattern

def get_nsubcatch_catch_pattern(catch_pattern, n):
    space_string = '.*?'
    space_list = [space_string for i in range(n)]
    first_group = catch_pattern.join(space_list)
    first_group = first_group.join('()')
    second_group = catch_pattern.join('()')
    new_catch_pattern = first_group + second_group
    return new_catch_pattern

def get_nsubcatch_replace_pattern(replace_pattern, n):
    new_replace_pattern = r'\g<1>' + replace_pattern
    return new_replace_pattern


### use test ###
catch_pattern = 'I'
replace_pattern = 'me'
test_string = "When I go to the park and I am alone I think the ducks laugh at I but I'm not sure."

regex_n_sub(catch_pattern, replace_pattern, test_string, 4)

这段代码可以直接复制到工作流中，并将替换后的对象返回给regex_n_sub()函数调用。

如果实现失败，请告诉我！

谢谢！

- jameshollisandrew

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Avinash Raj · Accepted Answer

像下面这样使用负向先行断言。

>>> s = "cat goose  mouse horse pig cat cow"
>>> re.sub(r'^((?:(?!cat).)*cat(?:(?!cat).)*)cat', r'\1Bull', s)
'cat goose  mouse horse pig Bull cow'

演示

^ 断言我们在字符串开头
(?:(?!cat).)* 匹配任意字符，但不能是 cat，可以出现零次或多次。
cat 匹配第一个 cat 子字符串。
(?:(?!cat).)* 匹配任意字符，但不能是 cat，可以出现零次或多次。
现在，将所有模式放入捕获组中，如((?:(?!cat).)*cat(?:(?!cat).)*)，以便以后引用这些捕获的字符。
现在匹配第二个 cat 字符串。

或

>>> s = "cat goose  mouse horse pig cat cow"
>>> re.sub(r'^(.*?(cat.*?){1})cat', r'\1Bull', s)
'cat goose  mouse horse pig Bull cow'

将 {} 中的数字更改为替换字符串 cat 的第一次、第二次或第n次出现。

要替换字符串 cat 的第三次出现，请在花括号内输入数字 2 。

>>> re.sub(r'^(.*?(cat.*?){2})cat', r'\1Bull', "cat goose  mouse horse pig cat foo cat cow")
'cat goose  mouse horse pig cat foo Bull cow'

在这里使用上面的正则表达式进行操作...