提取单词前/后的数字的正则表达式

Question

提取单词前/后的数字的正则表达式

5

我有 10000 个描述，我想使用正则表达式来提取与“被捕”短语相关的数字。

例如：

"police arrests 4 people"
"7 people were arrested".

数字范围为1-99。

我尝试了以下代码：

gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")

我不能简单地提取数字，因为描述中提到的数字与逮捕无关。

- serpentina

3个回答

2

以下方法对我有效（基于@PoulBak的思路）：

clear

input strL var1
"This is 1 long string saying that police arrests 4 people"
"3 news outlets today reported that 7 people were arrested"
"several witnesses saw 5 people arrested and other 3 killed"
end

generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")

list

   +-------------------------------------------------------------------------------------+
   |                                                       var1                     var2 |
   |-------------------------------------------------------------------------------------|
1. |  This is 1 long string saying that police arrests 4 people                arrests 4 |
2. |  3 news outlets today reported that 7 people were arrested   7 people were arrested |
3. | several witnesses saw 5 people arrested and other 3 killed        5 people arrested |
   +-------------------------------------------------------------------------------------+

- user8682794

2

也许是这样子吗？

(\d+)[^,.\d\n]+?(?=arrest|custody)|(?<=arrest|custody)[^,.\d\n]+?(\d+)

Regex101

请注意，这将不匹配数字的文本版本（即，五个人被逮捕了）- 如果需要，您必须将其并入。

分解该模式

(\d+)[^,.\d\n]+?(?=arrest|custody) 如果#在关键词之前的第一个选项
- (\d+) 要捕获的数字，带有+一个或多个数字
- [^,.\d\n]+? 匹配除逗号，、句号。、数字\d或换行符\n之外的任何内容。它们可以防止来自不同句子的FP（必须包含在同一句子中）- +？一次或多次（懒惰）
- (?=arrest|custody) 正向先行断言检查哪个单词：
(?<=arrest|custody)[^,.\d\n]+?(\d+) 如果#在关键词之后的第二个选项
- (?<=arrest|custody) 检查单词是否在#前的正向后行检测
- [^,.\d\n]+? 匹配除逗号，、句号。、数字\d或换行符\n之外的任何内容。它们可以防止来自不同句子的FP（必须包含在同一句子中）- +？一次或多次（懒惰）
- (\d+) 要捕获的数字，带有+一个或多个数字

其他注意事项

如果您要添加数字的文本表示，则将其并入(\d+)捕获组中。

如果您有任何额外的要关注的术语，而不仅仅是逮捕或拘留，则应将这些术语添加到两个lookaround组中。

- K.Dᴀᴠɪs

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Poul Bak · Accepted Answer

你可以使用这个正则表达式：

(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))

它通过交替将搜索分为两部分，即数字是在“arrests | arrested”之前还是之后。

它创建了一个非捕获的“Group”，匹配1-9的数字（可选）和0-9的数字。然后匹配0-20个任意字母和空格（其他单词），直到匹配“arrests OR arrested”。然后将其与相反情况（数字在最后）进行OR运算。

如果数字距离“arrests | arrested”不超过20个字符，则匹配成功。