在给定的字符串中打印两个特定单词之间的单词。

3

如果一个特定的词没有以另一个特定的词结尾,就保留它。这是我的字符串:

x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'

我希望您能翻译以下内容:需要打印并计算 johndead 或 death 或 died 之间的所有单词。如果 john 不以任何一个 died 或 dead 或 death 结尾,则跳过该单词,重新从 john 开始计数。
我的代码:
x = re.sub(r'[^\w]', ' ', x)  # removed all dots, commas, special symbols

for i in re.findall(r'(?<=john)' + '(.*?)' + '(?=dead|died|death)', x):
    print i
    print len([word for word in i.split()])

我的输出:

 got shot 
2
 with his          john got killed or 
6
 with his wife 
3

输出我想要的内容:
got shot
2
got killed or
3
with his wife
3

我不知道我犯了什么错误。 这只是一个样本输入。我一次必须检查20000个输入。

你的意思不是很清楚。由于“with his john got killed or”在单词jonh之后,它是否算作6个字符? - Marlon Abeykoon
@MarlonAbeykoon 约翰和他的...?,约翰被杀或死了 第一个 约翰 词不以 死亡或去世或死亡 结尾。从第二个 约翰 词开始。我想要的输出是 被杀了或 而不是 和他的约翰被杀了或 - Ganesh_
2个回答

2

我猜你想要重新开始,当字符串中出现另一个johndead|died|death之前。

那么,你可以通过单词john将字符串分割,并在得到的部分中开始匹配:

x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'
x = re.sub('\W+', ' ', re.sub('[^\w ]', '', x)).strip()
for e in x.split('john'):
    m = re.match('(.+?)(dead|died|death)', e)
    if m:
        print(m.group(1))
        print(len(m.group(1).split()))

产生结果:
 got shot 
2
 got killed or 
3
 with his wife 
3

此外,请注意,在我提出的替换建议之后(在分割和匹配之前),该字符串看起来像这样:
john got shot dead john with his john got killed or died in 1990 john with his wife dead or died

即,一个序列中没有多个空格。您可以通过稍后拆分空格来管理此内容,但我认为这样更加清晰简洁。

不错的解决方案,但在第一个John之前的部分不起作用。添加[1:]切片即可解决 :) - Rafael Albert
1
好的,如果句子以 ... dead john 开头(即第一个 john 前面有包含三个停用词之一的内容),它也会将其视为匹配。我会修复这个问题。 - jbndlr

2
你可以使用这个负向先行断言正则表达式:
>>> for i in re.findall(r'(?<=john)(?:(?!john).)*?(?=dead|died|death)', x):
...     print i.strip()
...     print len([word for word in i.split()])
...

got shot
2
got killed or
3
with his wife
3

这个正则表达式不再使用你的.*?,而是使用了(?:(?!john).)*?。这个表达式只有在匹配中不包含john时,才会懒惰地匹配0个或多个任何字符。

我还建议使用单词边界来匹配完整的单词:

re.findall(r'(?<=\bjohn\b)(?:(?!\bjohn\b).)*?(?=\b(?:dead|died|death)\b)', x)

Code Demo


1
比我的解决方案更加优雅,采用这个。 - jbndlr

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接