替换所有连续重复的字母（忽略特定单词）

Question

替换所有连续重复的字母（忽略特定单词）

5

我看到很多建议使用Python中的re(正则表达式)或.join函数来删除句子中连续重复的字母，但我想为特定单词设置例外情况。

例如：我想将这个句子 "sentence = 'hello, join this meeting heere using thiis lllink'" 转化为 "hello, join this meeting here using this link"。知道我有一个单词列表，以保留和忽略重复字母检查："keepWord = ['Hello', 'meeting']"。

以下是两个我发现有用的脚本：

Using .join:

import itertools

sentence = ''.join(c[0] for c in itertools.groupby(sentence))

Using regex:

import re

sentence = re.compile(r'(.)\1{1,}').sub(r'\1', sentence)

我有一个解决方案，但我认为还有更紧凑和高效的方案。我目前的解决方案是：

import itertools

sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['hello','meeting']

new_sentence = ''

for word in sentence.split():
    if word not in keepWord:
        new_word = ''.join(c[0] for c in itertools.groupby(word))
        new_sentence = sentence +" " + new_word
    else:
        new_sentence = sentence +" " + word

有什么建议吗？

- Aisha

在Hellllo的情况下，你期望得到什么？ - Chris

好的，我在我的建议中没有处理这种情况，这可以通过忽略else下面的字母的第一个出现来解决。 - Aisha

2个回答

1

尽管不是特别紧凑，但这里有一个相当简单的示例，使用正则表达式：函数subst将重复的字符替换为一个，然后使用re.sub来调用每个找到的单词。假设您的示例keepWord列表（在首次提及时）中包含大写字母标题的Hello，但文本中有小写字母的hello，因此您希望针对列表执行不区分大小写的比较。因此，它将同样适用于您的句子中包含Hello或hello。

import re

sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['Hello','meeting']

keepWord_s = set(word.lower() for word in keepWord)

def subst(match):
    word = match.group(0)
    return word if word.lower() in keepWord_s else re.sub(r'(.)\1+', r'\1', word)

print(re.sub(r'\b.+?\b', subst, sentence))

给出：

hello, join this meeting here using this link

- alani

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Wiktor Stribiżew · Accepted Answer

您可以从keepWord列表中匹配整个单词，并仅在其他上下文中替换两个或更多相同字母的序列：

import re
sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['hello','meeting']
new_sentence = re.sub(fr"\b(?:{'|'.join(keepWord)})\b|([^\W\d_])\1+", lambda x: x.group(1) or x.group(), sentence)
print(new_sentence)
# => hello, join this meeting here using this link

请查看Python演示

正则表达式将会是这样的

\b(?:hello|meeting)\b|([^\W\d_])\1+

请查看正则表达式示例。如果第1组匹配，则返回其值，否则将保留完整匹配（要保留的单词）。

模式详细信息

\b(?:hello|meeting)\b - 用单词边界包围的hello或meeting
| - 或
([^\W\d_]) - 第1组：任何Unicode字母
\1+ - 对第1组值进行一个或多个反向引用