如何在 R 中从向量中的每个字符串中提取第一个数字？

Question

如何在 R 中从向量中的每个字符串中提取第一个数字？

4

我是一个关于R语言正则表达式的新手。我有一个向量，我希望从这个向量中提取每个字符串中第一次出现的数字。

我有一个名为"shootsummary"的向量，它看起来像这样：

> head(shootsummary)
[1] Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police.                                         
[2] Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him.                           
[3] John Zawahri, 23, armed with a homemade assault rifle and high-capacity magazines, killed his brother and father at home and then headed to Santa Monica College, where he was eventually killed by police.      
[4] Dennis Clark III, 27, shot and killed his girlfriend in their shared apartment, and then shot two witnesses in the building's parking lot and a third victim in another apartment, before being killed by police.
[5] Kurt Myers, 64, shot six people in neighboring towns, killing two in a barbershop and two at a car care business, before being killed by officers in a shootout after a nearly 19-hour standoff.

第一个数字代表个人的“年龄”，我希望能够从这些字符串中提取年龄，而不混淆列表中其他数字。

我使用了：

as.numeric(gsub("\\D", "", shootsummary))

它的结果是:

[1]  34128     42     23     27   6419

我希望您能提供只提取句子中年龄信息的结果，而不提取年龄之后出现的其他数字。

[1]  34     42     23     27   64

- user3563667

假设向量元素中有一个没有数字，你想返回什么？在我的解决方案中，它返回 NA。 - akrun

7个回答

2

一种选择是使用 stringr 中的 str_extract 函数，再加上 as.numeric 转换。

> library(stringr)
> as.numeric(str_extract(shootsummary, "[0-9]+"))
# [1] 34 42 23 27 64

更新回答您在此答案评论中的问题，这里有一个简单的解释。每个函数的完整解释可以在其帮助文件中找到。

str_extract 返回正则表达式的第一个匹配项。它针对其第一个参数中的字符向量进行矢量化处理。
正则表达式[0-9]+ 匹配任何字符：'0' 到 '9'（1 次或多次）
as.numeric 将结果字符向量转换为数字向量。

- Rich Scriven

谢谢 Richard，你的代码也可用，我能知道你是怎么做到的吗？我对正则表达式完全是个新手，只熟悉非常基本和简单的代码。提前感谢你。 - user3563667

谢谢你在stringr中展示它，我一直想使用stringr，你帮助我开始使用它 :) - user3563667

2

您可以尝试以下的sub命令：

> test
[1] "Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police."              
[2] "Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him."
> sub("^\\D*(\\d+).*$", "\\1", test)
[1] "34" "42"

模式解释：

^ 表示我们在行的开头。
\D* 匹配零个或多个非数字字符。
(\d+) 然后将接下来的一个或多个数字捕获到第1组(第一个数字)中。
.* 匹配任何字符零次或多次。
$ 表示我们在行的结尾。
最后，所有匹配的字符都被替换为第一组中存在的字符。

- Avinash Raj

谢谢Avinash，它像魔法一样奏效了。由于我对正则表达式不熟悉，您能否帮助我更清楚地了解您在这里做了什么。提前致谢。 - user3563667

1

如何？

splitbycomma <- strsplit(shootsummary, ",")
as.numeric(  sapply(splitbycomma, "[", 2)  )

- Berry Boessenkool

我猜我在这里做错了什么...第一行代码对我来说没有运行，让我调整一下然后再回来。谢谢 berry。 - user3563667

1

您可以使用sub:

test <- ("xff 34 sfsdg 352 efsrg")

sub(".*?(\\d+).*", "\\1", test)
# [1] "34"

正则表达式是如何工作的？

. 匹配任意字符。量词 * 表示任意数量的出现。 ? 用于匹配所有字符，直到第一个匹配 \\d（数字）。量词 + 表示一个或多个出现。括号中的 \\d 是第一个匹配组。这可能后跟其他字符 (.*)。第二个参数 (\\1) 用第一个匹配组（即第一个数字）替换整个字符串。

- Sven Hohenstein

谢谢Sven，你的代码和Avinash的代码有些相似，我很好奇你是如何做到的，我想了解其中的概念...我对正则表达式还不熟悉，提前感谢。 - user3563667

@user3563667，我添加了一个解释。 - Sven Hohenstein

1

R的regmatches()方法返回一个向量，其中包含每个元素中第一个正则表达式匹配项：

regmatches(shootsummary, regexpr("\\d+", shootsummary, perl=TRUE));

- Tim Pietzcker

嗨，Tim，谢谢。这似乎是更简单地利用R的包和函数的能力..但它返回了一个字符串..但仍然..那很有用..谢谢 - user3563667

0

你可以使用 strex 包中的 str_first_number() 函数来完成这个任务，或者如果需要更通用的功能，可以使用 str_nth_number() 函数。

pacman::p_load(strex)
shootsummary <- c("Aaron Alexis, 34, a military veteran and contractor ...",
                  "Pedro Vargas, 42, set fire to his apartment, killed six ...",
                  "John Zawahri, 23, armed with a homemade assault rifle ...",
                  "John Zawahri, 23, armed with a homemade assault rifle ...",
                  "Dennis Clark III, 27, shot and killed his girlfriend ...",
                  "Kurt Myers, 64, shot six people in neighboring ..."
)
str_first_number(shootsummary)
#> [1] 34 42 23 23 27 64
str_nth_number(shootsummary, n = 1)
#> [1] 34 42 23 23 27 64

这段代码是在2018年9月3日使用reprex package (v0.2.0)创建的。

- Rory Nolan

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- akrun · Accepted Answer

stringi会更快

library(stringi)
stri_extract_first(shootsummary, regex="\\d+")
#[1] "34" "42" "23" "27" "64"