如何在 R 中从向量中的每个字符串中提取第一个数字?

4

我是一个关于R语言正则表达式的新手。我有一个向量,我希望从这个向量中提取每个字符串中第一次出现的数字。

我有一个名为"shootsummary"的向量,它看起来像这样:

> head(shootsummary)
[1] Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police.                                         
[2] Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him.                           
[3] John Zawahri, 23, armed with a homemade assault rifle and high-capacity magazines, killed his brother and father at home and then headed to Santa Monica College, where he was eventually killed by police.      
[4] Dennis Clark III, 27, shot and killed his girlfriend in their shared apartment, and then shot two witnesses in the building's parking lot and a third victim in another apartment, before being killed by police.
[5] Kurt Myers, 64, shot six people in neighboring towns, killing two in a barbershop and two at a car care business, before being killed by officers in a shootout after a nearly 19-hour standoff.  

第一个数字代表个人的“年龄”,我希望能够从这些字符串中提取年龄,而不混淆列表中其他数字。
我使用了:
as.numeric(gsub("\\D", "", shootsummary))

它的结果是:
[1]  34128     42     23     27   6419  

我希望您能提供只提取句子中年龄信息的结果,而不提取年龄之后出现的其他数字。
[1]  34     42     23     27   64

假设向量元素中有一个没有数字,你想返回什么?在我的解决方案中,它返回 NA - akrun
7个回答

3

stringi会更快

library(stringi)
stri_extract_first(shootsummary, regex="\\d+")
#[1] "34" "42" "23" "27" "64"

谢谢akrun,我安装了'stringi'并成功运行了代码。感谢您的及早回复。 - user3563667

2

一种选择是使用 stringr 中的 str_extract 函数,再加上 as.numeric 转换。

> library(stringr)
> as.numeric(str_extract(shootsummary, "[0-9]+"))
# [1] 34 42 23 27 64

更新 回答您在此答案评论中的问题,这里有一个简单的解释。每个函数的完整解释可以在其帮助文件中找到。

  • str_extract 返回正则表达式的第一个匹配项。它针对其第一个参数中的字符向量进行矢量化处理。
  • 正则表达式[0-9]+ 匹配任何字符:'0' 到 '9'(1 次或多次)
  • as.numeric 将结果字符向量转换为数字向量。

谢谢 Richard,你的代码也可用,我能知道你是怎么做到的吗?我对正则表达式完全是个新手,只熟悉非常基本和简单的代码。提前感谢你。 - user3563667
谢谢你在stringr中展示它,我一直想使用stringr,你帮助我开始使用它 :) - user3563667

2
您可以尝试以下的sub命令:
> test
[1] "Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police."              
[2] "Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him."
> sub("^\\D*(\\d+).*$", "\\1", test)
[1] "34" "42"

模式解释:

  • ^ 表示我们在行的开头。
  • \D* 匹配零个或多个非数字字符。
  • (\d+) 然后将接下来的一个或多个数字捕获到第1组(第一个数字)中。
  • .* 匹配任何字符零次或多次。
  • $ 表示我们在行的结尾。
  • 最后,所有匹配的字符都被替换为第一组中存在的字符。

谢谢Avinash,它像魔法一样奏效了。由于我对正则表达式不熟悉,您能否帮助我更清楚地了解您在这里做了什么。提前致谢。 - user3563667

1
如何?
splitbycomma <- strsplit(shootsummary, ",")
as.numeric(  sapply(splitbycomma, "[", 2)  )

我猜我在这里做错了什么...第一行代码对我来说没有运行,让我调整一下然后再回来。谢谢 berry。 - user3563667

1
您可以使用sub:
test <- ("xff 34 sfsdg 352 efsrg")

sub(".*?(\\d+).*", "\\1", test)
# [1] "34"

正则表达式是如何工作的?

. 匹配任意字符。量词 * 表示任意数量的出现。 ? 用于匹配所有字符,直到第一个匹配 \\d(数字)。量词 + 表示一个或多个出现。括号中的 \\d 是第一个匹配组。这可能后跟其他字符 (.*)。第二个参数 (\\1) 用第一个匹配组(即第一个数字)替换整个字符串。


谢谢Sven,你的代码和Avinash的代码有些相似,我很好奇你是如何做到的,我想了解其中的概念...我对正则表达式还不熟悉,提前感谢。 - user3563667
@user3563667,我添加了一个解释。 - Sven Hohenstein

1

R的regmatches()方法返回一个向量,其中包含每个元素中第一个正则表达式匹配项:

regmatches(shootsummary, regexpr("\\d+", shootsummary, perl=TRUE));

嗨,Tim,谢谢。这似乎是更简单地利用R的包和函数的能力..但它返回了一个字符串..但仍然..那很有用..谢谢 - user3563667

0

你可以使用 strex 包中的 str_first_number() 函数来完成这个任务,或者如果需要更通用的功能,可以使用 str_nth_number() 函数。

pacman::p_load(strex)
shootsummary <- c("Aaron Alexis, 34, a military veteran and contractor ...",
                  "Pedro Vargas, 42, set fire to his apartment, killed six ...",
                  "John Zawahri, 23, armed with a homemade assault rifle ...",
                  "John Zawahri, 23, armed with a homemade assault rifle ...",
                  "Dennis Clark III, 27, shot and killed his girlfriend ...",
                  "Kurt Myers, 64, shot six people in neighboring ..."
)
str_first_number(shootsummary)
#> [1] 34 42 23 23 27 64
str_nth_number(shootsummary, n = 1)
#> [1] 34 42 23 23 27 64

这段代码是在2018年9月3日使用reprex package (v0.2.0)创建的。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接