使用Python正则表达式捕获名词所有格和前缀

Question

使用Python正则表达式捕获名词所有格和前缀

pythonregex

3

我正在尝试编写Python正则表达式以捕获语料库中出现的“群岛”的各种形式。

这是一个测试字符串：

这是我的关于岛屿、群岛和群岛空间的句子。我想确保不会忘记群岛的猫。我们不能忘记元-群岛和原始群岛历史学家，他们倾向于拼写复数形式为“archipelagoes”。

我想从字符串中捕获以下内容：

archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes

尝试1

使用正则表达式(archipelag.*?)\b并在Pythex测试后，我捕获了所有六种形式的一部分。但存在以下问题：

archipelago's只被捕获为archipelago。我想要得到所有格。
meta-archipelagic只被捕获为archipelagic。我想要能够捕获带有连字符的前缀。
protoarchipelagic只被捕获为archipelagic。我想要能够捕获非连字符的前缀。

尝试2

如果我尝试使用正则表达式(archipelag.*?)\s（参见Pythex），所有格archipelago's现在被捕获了，但是紧随第一个实例的逗号也被捕获了（例如，archipelagos,）。它完全没有捕获最后的'archipelagoes.'。

- Brian Croxall

3个回答

1

请更具体地编写正则表达式。这个可以帮助你：

\b([a-zA-Z-]*archipelag[a-zA-Z']+)\b

解释:

\b 断言在单词边界处
[a-zA-Z-]* 匹配零个或多个字母或 -
[a-zA-Z-]+ 匹配一个或多个字母或 '

你可以在这里检查它。

- ailin

1

尝试了这个，它有效：

[a-zA-Z-]*arch[a-zA-Z']*

- alessandrocb

这适用于测试字符串，但我认为它会匹配“architecture”，“arches”和其他类似的单词。我的理解是 ([a-zA-Z-]*?archipel[a-zA-Z']*)。 - Brian Croxall

@BrianCroxall 是的，没错。([a-zA-Z-]*?archipel[a-zA-Z']*) 将是一个更好的答案。 - alessandrocb

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Patrick Haugh · Accepted Answer

这个正则表达式((?:\b\w+\b-)?\b\w*archipelag\w*\b(?:'s)?)可以用于此。如果您有其他要求，可能需要进一步修改它。

请注意使用非捕获组(?:)来分组表达式，以便我们可以使用?匹配零个或一个。

import re

pat = re.compile(r"((?:\b\w+\b-)?\b\w*archipelag\w*\b(?:'s)?)")

corpus = "This is my sentence about islands, archipelagos, and archipelagic spaces. I want to make sure that the archipelago's cat is not forgotten. And we cannot forget the meta-archipelagic and protoarchipelagic historians, who tend to spell the plural 'archipelagoes.'"

for match in pat.findall(corpus):
    print(match)

打印

archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes

在regex101上查看