Python中用于分割单词的正则表达式

Question

Python中用于分割单词的正则表达式

25

我将设计一个正则表达式来从给定的文本中分离所有实际单词:

输入示例:

"John's mom went there, but he wasn't there. So she said: 'Where are you'"

期望输出：

["John's", "mom", "went", "there", "but", "he", "wasn't", "there", "So", "she", "said", "Where", "are", "you"]

我想到了一个正则表达式，类似这样：

"(([^a-zA-Z]+')|('[^a-zA-Z]+))|([^a-zA-Z']+)"

在Python中拆分后，结果包含None项和空格。

如何摆脱空项？为什么空格不匹配？

编辑：

在空格上拆分会得到像["there."]这样的项。

在非字母上拆分，会得到像["John","s"]这样的项。

在非字母（除了'）上拆分，会得到像["'Where","you'"]这样的项。

- Betamoo

为什么必须使用split而不是findall？ - Chris Wesseling

在这里定义你想匹配的内容要简单得多：使用 r"[a-zA-Z]+(?:'[a-z])?" 的 findall 就可以完成任务。所以我真的很好奇为什么想要一个 split。 - Chris Wesseling

另一个修复漏洞的更新。现在它可以捕获以撇号开头或结尾的单个字母。 - FallenAngel

@ChrisWesseling 是的，我认为那样会容易得多，谢谢！ - Betamoo

4个回答

9

您的正则表达式中有太多捕获组了，请将它们改为非捕获组：

(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)

演示:

>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
>>> re.split("(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)", s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', '']

这只返回一个空元素。

- Martijn Pieters

我之前考虑过这个简化版，但问题是类似“她说‘走吧’”这样的内容会导致 [...，"‘走吧’"] 这样的结果，这不是正确的。 - Betamoo

@Betamoo：是的，调整一下，我看到你目前表达式的结果（漏了一个括号）。 - Martijn Pieters

@VishalSuthar：抱歉，但你的编辑很糟糕。“non-capturing”是普通词汇，不需要被渲染成代码。 - Martijn Pieters

我刚刚遇到了完全不同的问题，但将 ? 改为 ?: 也解决了它！谢谢。 - Felipe

2

这个正则表达式只允许一个结束的撇号，后面可能跟着一个字符：

([\w][\w]*'?\w?)

演示：

>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
>>> re.compile("([\w][\w]*'?\w?)").findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', "a'"]

- rileyteige

0

我是Python的新手，但我认为我已经弄清楚了

import re
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
result = re.findall(r"(.+?)[\s'\",!]{1,}", s)
print(result)

结果 ['John', '的', '妈妈', '去了', '那里', '但他', '不在那里。所以', '她说:', '你在哪里？']

- user3464029

你的回答将 John's 分成了 John 和 s，并保留了 said: 中的冒号。我很感谢你的贡献，但是这个问题已经有了完全符合要求的答案，考虑帮助其他还没有理想答案的问题。不过，还是非常感谢你的贡献。 - Rolv Apneseth

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- FallenAngel · Accepted Answer

你可以使用字符串函数代替正则表达式：

to_be_removed = ".,:!" # all characters to be removed
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"

for c in to_be_removed:
    s = s.replace(c, '')
s.split()

但是，在你的例子中，你不想删除 John's 中的撇号，但你希望删除 you!!' 中的撇号。所以字符串操作在这一点上失败了，你需要一个精细调整的正则表达式。

编辑：可能一个简单的正则表达式可以解决你的问题：

(\w[\w']*)

它将捕获以字母开头的所有字符，并在下一个字符是撇号或字母时继续捕获。

(\w[\w']*\w)

这个第二个正则表达式是针对一个非常特定的情况... 第一个正则可以捕获像you'这样的单词。而这个正则将避免这种情况，只有在单词内部（不在开头或结尾）才会捕获撇号。但在这一点上，会出现这样一种情况：使用第二个正则表达式无法捕获以s结尾且表示所有权的名字中的后置撇号Moss' mom。您必须决定是否捕获此类名字中的后置撇号。

例子：

rgx = re.compile("([\w][\w']*\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you']

更新2：我发现我的正则表达式有一个bug！它不能捕获单个字母后面跟着一个撇号，比如A'。修复后的全新正则表达式在这里：

(\w[\w']*\w|\w)

rgx = re.compile("(\w[\w']*\w|\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', 'a']