如何在Python中从字符串中删除连续重复的单词

Question

如何在Python中从字符串中删除连续重复的单词

9

我有一个字符串，需要去除相邻重复的单词。

mystring = "my friend's new new new new and old old cats are running running in the street"

我的输出应该如下所示。

myoutput = "my friend's new and old cats are running in the street"

我正在使用以下 Python 代码来完成此操作。

 mylist = []
 for i, w in enumerate(mystring.split()):
     for n, l in enumerate(mystring.split()):
             if l != w and i == n-1:
                     mylist.append(w)
 mylist.append(mystring.split()[-1])
 myoutput = " ".join(mylist)

然而，我的代码的时间复杂度是O(n²)，由于数据集非常庞大，效率很低。我想知道是否有更高效的Python解决方法。如果需要更多细节，我很乐意提供。

- EmJ

7个回答

5

使用itertools.groupby：

import itertools

>> ' '.join(k for k, _ in itertools.groupby(mystring.split()))
"my friend's new and old cats are running in the street"

mystring.split() 可以将 mystring 字符串分割成多个部分。
itertools.groupby 可以高效地按照 k 连续分组单词。
使用列表推导式，我们只需要获取分组键。
最后使用空格进行连接。

该算法的时间复杂度与输入字符串的长度成线性关系。

- Ami Tavory

2

试试这个：

mystring = "my friend's new new new new and old old cats are running running in the street"

words = mystring.split()

answer = [each_pair[0] for each_pair in zip(words, words[1:]) if each_pair[0] != each_pair[1]] + [words[-1]]

print(' '.join(answer))

输出:

my friend's new and old cats are running in the street

在这个程序中，我们迭代连续单词的元组，并将每个元组的第一个单词附加到答案中，如果元组中的两个单词不同。最后，我们还将最后一个单词附加到答案中。

- tkhurana96

2

现在来点不同的东西。这个解决方案使用生成器，直到最终重新组装结果字符串，以尽可能节省内存，以防原始字符串非常大。

import re

def remove_duplicates_helper(s):
    words = (x.group(0) for x in re.finditer(r"[^\s]+", s))
    current = None
    for word in words:
        if word != current:
            yield word
            current = word

def remove_duplicates(s):
    return ' '.join(remove_duplicates_helper(s))

mystring = "my friend's new new new new and old old cats are running running in the street"
print(remove_duplicates(mystring))

我朋友的新老猫正在街上奔跑

- Booboo

1

请查看我的代码：

def strip2single(textarr):
    if len(textarr)==0:
        return ""
    result=textarr[0]
    for i in range(1,len(textarr)):
        if textarr[i]!=textarr[i-1]:
            result=result+' '+textarr[i]
    return(result)


mystring = "my friend's new new new new and old old cats are running running in the street"

y=strip2single(mystring.split())
print(y)

- Krishna Rao

1

这个问题存在一个O(n)的解决方案。

mystring = "my friend's new new new new and old old cats are running running in the street"

分割成单词

words = mystring.split()

如果当前单词与前一个单词相同，则跳过该单词。

myoutput = ' '.join([x for i,x in enumerate(words) if i==0 or x!=words[i-1]])

- s.singh

1

枚举操作被执行了两次。类似于这样修改代码可以使您的代码更有效率。

 mylist = []
 l1 = enumerate(mystring.split())

 for i, w in l1:
     for n, l in l1:
             if l != w and i == n-1:
                     mylist.append(w)
 mylist.append(mystring.split()[-1])
 myoutput = " ".join(mylist)

- Archulan Rajakumaran

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- RomanPerekhrest · Accepted Answer

简洁的正则表达式魔法：

import re

mystring = "my friend's new new new new and old old cats are running running in the street"
res = re.sub(r'\b(\w+\s*)\1{1,}', '\\1', mystring)
print(res)

正则表达式模式详情：

\b - 单词边界
(\w+\s*) - 一个或多个单词字符\w+后跟任意数量的空格字符\s* - 包含在捕获组(...)中
\1{1,} - 引用第一个捕获组出现一次或多次{1,}

输出结果：

my friend's new and old cats are running in the street