使用R中的正则表达式提取所有匹配项到新列

Question

使用R中的正则表达式提取所有匹配项到新列

3

在我的数据中，我有一列开放文本字段数据，类似于以下示例：

d <- tribble(
  ~x,
  "i am 10 and she is 50",
  "he is 32 and i am 22",
  "he may be 70 and she may be 99",
)

我想使用正则表达式从中提取所有两位数字，放入一个名为y的新列中。下面是我的代码，它能够成功提取第一个匹配项：

d %>%
  mutate(y = str_extract(x, "([0-9]{2})"))

# A tibble: 3 x 2
  x                              y    
  <chr>                          <chr>
1 i am 10 and she is 50          10   
2 he is 32 and i am 22           32   
3 he may be 70 and she may be 99 70

但是，有没有一种方法可以将两个两位数提取到同一列并使用某些标准分隔符（例如逗号）？

- Trent

这篇文章应该会有帮助：https://dev59.com/arXna4cB1Zd3GeqPFiA8。为了澄清，您*只想提取两位数的数字？ - camille

2个回答

3

我们可以使用str_extract_all代替str_extract，因为str_extract只匹配第一个实例，而_all后缀是全局的，并且会提取list中的所有实例，可以使用unnest_wider将其转换回两列。

library(dplyr)
library(tidyr)
library(stringr)
d %>%  
    mutate(out =  str_extract_all(x, "\\d{2}")) %>% 
    unnest_wider(c(out)) %>%
    rename_at(-1, ~ c('y', 'z')) %>%
    type.convert(as.is = TRUE)
# A tibble: 3 x 3
# x                                  y     z
#  <chr>                          <int> <int>
#1 i am 10 and she is 50             10    50
#2 he is 32 and i am 22              32    22
#3 he may be 70 and she may be 99    70    99

如果我们需要一个以,为分隔符的字符串列，在提取到list后，可以使用map循环遍历list，并使用toString（paste(., collapse=", ")的包装器）将所有元素连接成单个字符串。

library(purrr)
d %>%
   mutate(y = str_extract_all(x, "\\b\\d{2}\\b") %>%
                 map_chr(toString))
# A tibble: 3 x 2
#  x                              y     
#  <chr>                          <chr> 
#1 i am 10 and she is 50          10, 50
#2 he is 32 and i am 22           32, 22
#3 he may be 70 and she may be 99 70, 99

- akrun

我刚刚尝试了你的代码，它显示找不到函数“unnest_wider”。 - Trent

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- acylam · Accepted Answer

我们也可以使用tidyr中的extract和unite：

library(dplyr)
library(tidyr)

d %>%
  extract(x, c('y', 'z'), regex = "(\\d+)[^\\d]+(\\d+)", remove = FALSE)

输出：

# A tibble: 3 x 3
  x                              y     z    
  <chr>                          <chr> <chr>
1 i am 10 and she is 50          10    50   
2 he is 32 and i am 22           32    22   
3 he may be 70 and she may be 99 70    99

返回单列：

d %>%
  extract(x, c('y', 'z'), regex = "(\\d+)[^\\d]+(\\d+)", remove = FALSE) %>%
  unite('y', y, z, sep = ', ')

输出：

# A tibble: 3 x 3
  x                              y     
  <chr>                          <chr> 
1 i am 10 and she is 50          10, 50
2 he is 32 and i am 22           32, 22
3 he may be 70 and she may be 99 70, 99