Python正则表达式:分词英语缩略词

4

我正在尝试以某种方式解析字符串,以便分离出所有单词组成部分,即使是已经缩写的单词。例如,“shouldn't”的记号化将是["should", "n't"]。

然而,nltk模块似乎无法胜任:

"I wouldn't've done that."

记号化为:

['I', "wouldn't", "'ve", 'done', 'that', '.']

而“wouldn't've”的所需记号化是:['would', "n't", "'ve"]

研究常见的英语缩写后,我正在尝试编写一个正则表达式来完成这个任务,但我很难弄清楚如何仅匹配一个“'ve”。例如,以下标记都可以终止缩写:

n't、've、'd、'll、's、'm、're

而标记“'ve”也可以跟随其他缩写,例如:

'd've、n't've和(可能)'ll've

目前,我正在试图处理这个正则表达式:

\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\b

但是,这个模式也匹配了糟糕的形式:

"wouldn't've've"

问题似乎在于第三个撇号将整个单词分隔符配合,因此最终的“'ve”标记匹配整个正则表达式。

我一直无法想出区分单词边界和撇号的方法,如果失败,我可以接受替代策略的建议。

另外,我想知道是否有办法在字符类中包含单词边界特殊字符。根据Python文档,字符类中的\b匹配退格键,似乎没有办法解决这个问题。

编辑:

以下是输出结果:

>>>pattern = re.compile(r"\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\b")
>>>matches = pattern.findall("She'll wish she hadn't've done that.")
>>>print matches
[("'ll", '', ''), ("n't", "'ve", ''), ('', '', "'ve")]

我无法理解第三个匹配项。特别是,我意识到如果第三个撇号与前导的\b相匹配,那么我不知道将匹配字符类[a-zA-Z]+的内容是什么。

5个回答

3
(?<!['"\w])(['"])?([a-zA-Z]+(?:('d|'ll|n't)('ve)?|('s|'m|'re|'ve)))(?(1)\1|(?!\1))(?!['"\w])

编辑:\2是匹配项,\3是第一组,\4是第二组,\5是第三组。


谢谢。但是,这会在"She'll wish she hadn't've've done that."上感到困惑,并且有时会返回许多无关的组。 - Schemer
你能提供一些例子,这样我们就知道要测试什么吗?我编辑了我的代码,使其适用于我的一些示例和你的示例。演示:https://regex101.com/r/iV4cX6/1 - AMDcze
你的前后断言引导了我到这个正则表达式:\b(?<!')[a-zA-Z]+('s|'m|'re|'ve)|(?:('ll|'d|n't)('ve)?)(?!')\b,它目前解决了这个问题。撇号被匹配为单词边界,但是在've的开头和结尾也如此。另外,在我注意到不匹配的括号之前,我可能已经死了并去了地狱。谢谢! - Schemer

3
你可以使用以下完整的正则表达式:
import re
patterns_list = [r'\s',r'(n\'t)',r'\'m',r'(\'ll)',r'(\'ve)',r'(\'s)',r'(\'re)',r'(\'d)']
pattern=re.compile('|'.join(patterns_list))
s="I wouldn't've done that."

print [i for i in pattern.split(s) if i]

结果:

['I', 'would', "n't", "'ve", 'done', 'that.']

1
谢谢。但这也匹配了格式不正确的“wouldn't've've”,我想忽略它。 - Schemer

2
>>> import nltk
>>> nltk.word_tokenize("I wouldn't've done that.")
['I', "wouldn't", "'ve", 'done', 'that', '.']

所以:
>>> from itertools import chain
>>> [nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]
[['I'], ['would', "n't"], ["'ve"], ['done'], ['that'], ['.']]
>>> list(chain(*[nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]))
['I', 'would', "n't", "'ve", 'done', 'that', '.']

1
你可以使用这个正则表达式对文本进行分词:

(?:(?!.')\w)+|\w?'\w+|[^\s\w]

使用方法:

>>> re.findall(r"(?:(?!.')\w)+|\w?'\w+|[^\s\w]", "I wouldn't've done that.")
['I', 'would', "n't", "'ve", 'done', 'that', '.']

1
谢谢。但是这个模式并没有排除格式不正确的“wouldn't've've”。 - Schemer

0

这是一个简单的例子

text = ' ' + text.lower() + ' '
text = text.replace(" won't ", ' will not ').replace("n't ", ' not ') \
    .replace("'s ", ' is ').replace("'m ", ' am ') \
    .replace("'ll ", ' will ').replace("'d ", ' would ') \
    .replace("'re ", ' are ').replace("'ve ", ' have ')

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接