在给定的字符串中打印两个特定单词之间的单词。

Question

在给定的字符串中打印两个特定单词之间的单词。

3

如果一个特定的词没有以另一个特定的词结尾，就保留它。这是我的字符串：

x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'

我希望您能翻译以下内容：需要打印并计算 john 和 dead 或 death 或 died 之间的所有单词。如果 john 不以任何一个 died 或 dead 或 death 结尾，则跳过该单词，重新从 john 开始计数。

我的代码:

x = re.sub(r'[^\w]', ' ', x)  # removed all dots, commas, special symbols

for i in re.findall(r'(?<=john)' + '(.*?)' + '(?=dead|died|death)', x):
    print i
    print len([word for word in i.split()])

我的输出：

 got shot 
2
 with his          john got killed or 
6
 with his wife 
3

输出我想要的内容：

got shot
2
got killed or
3
with his wife
3

我不知道我犯了什么错误。这只是一个样本输入。我一次必须检查20000个输入。

- Ganesh_

你的意思不是很清楚。由于“with his john got killed or”在单词jonh之后，它是否算作6个字符？ - Marlon Abeykoon

@MarlonAbeykoon 约翰和他的...？，约翰被杀或死了 第一个 约翰 词不以 死亡或去世或死亡 结尾。从第二个 约翰 词开始。我想要的输出是 被杀了或 而不是 和他的约翰被杀了或。 - Ganesh_

2个回答

2

你可以使用这个负向先行断言正则表达式：

>>> for i in re.findall(r'(?<=john)(?:(?!john).)*?(?=dead|died|death)', x):
...     print i.strip()
...     print len([word for word in i.split()])
...

got shot
2
got killed or
3
with his wife
3

这个正则表达式不再使用你的.*?，而是使用了(?:(?!john).)*?。这个表达式只有在匹配中不包含john时，才会懒惰地匹配0个或多个任何字符。

我还建议使用单词边界来匹配完整的单词：

re.findall(r'(?<=\bjohn\b)(?:(?!\bjohn\b).)*?(?=\b(?:dead|died|death)\b)', x)

Code Demo

- anubhava

1

比我的解决方案更加优雅，采用这个。 - jbndlr

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jbndlr · Accepted Answer

我猜你想要重新开始，当字符串中出现另一个john在dead|died|death之前。

那么，你可以通过单词john将字符串分割，并在得到的部分中开始匹配：

x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'
x = re.sub('\W+', ' ', re.sub('[^\w ]', '', x)).strip()
for e in x.split('john'):
    m = re.match('(.+?)(dead|died|death)', e)
    if m:
        print(m.group(1))
        print(len(m.group(1).split()))

产生结果：

 got shot 
2
 got killed or 
3
 with his wife 
3

此外，请注意，在我提出的替换建议之后（在分割和匹配之前），该字符串看起来像这样：

john got shot dead john with his john got killed or died in 1990 john with his wife dead or died

即，一个序列中没有多个空格。您可以通过稍后拆分空格来管理此内容，但我认为这样更加清晰简洁。