我有 10000
个描述,我想使用正则表达式来提取与“被捕”短语相关的数字。
例如:
"police arrests 4 people"
"7 people were arrested".
数字范围为
1-99
。我尝试了以下代码:
gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")
我不能简单地提取数字,因为描述中提到的数字与逮捕无关。
(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))
clear
input strL var1
"This is 1 long string saying that police arrests 4 people"
"3 news outlets today reported that 7 people were arrested"
"several witnesses saw 5 people arrested and other 3 killed"
end
generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")
list
+-------------------------------------------------------------------------------------+
| var1 var2 |
|-------------------------------------------------------------------------------------|
1. | This is 1 long string saying that police arrests 4 people arrests 4 |
2. | 3 news outlets today reported that 7 people were arrested 7 people were arrested |
3. | several witnesses saw 5 people arrested and other 3 killed 5 people arrested |
+-------------------------------------------------------------------------------------+
(\d+)[^,.\d\n]+?(?=arrest|custody)|(?<=arrest|custody)[^,.\d\n]+?(\d+)
请注意,这将不匹配数字的文本版本(即,五个人被逮捕了)- 如果需要,您必须将其并入。
(\d+)[^,.\d\n]+?(?=arrest|custody)
如果#在关键词之前的第一个选项
(\d+)
要捕获的数字,带有+
一个或多个数字[^,.\d\n]+?
匹配除逗号,
、句号。
、数字\d
或换行符\n
之外的任何内容。它们可以防止来自不同句子的FP(必须包含在同一句子中)- +?
一次或多次(懒惰)(?=arrest|custody)
正向先行断言检查哪个单词: (?<=arrest|custody)[^,.\d\n]+?(\d+)
如果#在关键词之后的第二个选项
(?<=arrest|custody)
检查单词是否在#前的正向后行检测[^,.\d\n]+?
匹配除逗号,
、句号。
、数字\d
或换行符\n
之外的任何内容。它们可以防止来自不同句子的FP(必须包含在同一句子中)- +?
一次或多次(懒惰)(\d+)
要捕获的数字,带有+
一个或多个数字如果您要添加数字的文本表示,则将其并入(\d+)
捕获组中。
如果您有任何额外的要关注的术语,而不仅仅是逮捕或拘留,则应将这些术语添加到两个lookaround组中。