在R中提取字符串之后的日期

Question

在R中提取字符串之后的日期

3

我正在尝试使用tidyr的extract函数从Notes列中提取日期。我正在处理的数据如下：

dates <- data.frame(col1 = c("customer", "customer2", "customer3"),
                    Notes = c("DOB: 12/10/62
START: 09/01/2019
END: 09/01/2020", "
S/DATE: 28/08/19
R/DATE: 27/08/20", "DOB: 13/01/1980
Start:04/12/2018"),
                    End_date = NA,
                    Start_Date = NA )

我尝试这样提取字符串"S/DATE"后面的日期：

extract <- extract(
  dates,
  col = "Notes",
  into = "Start_date",
  regex = "(?<=(S\\/DATE:)).*"  # Using regex lookahead
)

然而，这只提取了字符串"S/DATE:"，而不是其后面的日期。当我在regex101.com上尝试时，它按预期工作。

谢谢，Ibrahim

- Ibrahim

4个回答

1

一种方法可以是这样的（假设您需要S/DATE或START之一作为您期望的新列名为Start_date）。如果不需要所有这些值，您可以轻松修改此语法。

解释：

在最内部的表达式中，通过分隔符:或\n将Notes列拆分为列表。
在此列表中，移除空白。
使用sapply从修改后的列表项中提取Start或S/Date后面的项，它将列表简化为向量（如果可能）。
最后在最外层的表达式中使用lubridate::dmy。

sapply(strsplit(dates$Notes, 
                 "[: | \n]"), 
       function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))])

[1] "09/01/2019" "28/08/19"   "04/12/2018"

如果您将上述内容包装在 lubridate::dmy 中，日期也将被正确格式化。

dmy(sapply(strsplit(dates$Notes, 
                        "[: | \n]"), 
               function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))]))

[1] "2019-01-09" "2019-08-28" "2018-12-04"

此外，这可以传递到 dplyr 管道中，以便同时在您的 dates 中创建一个新列。

dates %>% mutate(Start_Date = dmy(sapply(strsplit(Notes, 
                                                  "[: | \n]"), 
                                         function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))])))

       col1                                             Notes End_date Start_Date
1  customer DOB: 12/10/62\nSTART: 09/01/2019\nEND: 09/01/2020       NA 2019-01-09
2 customer2              \nS/DATE: 28/08/19\nR/DATE: 27/08/20       NA 2019-08-28
3 customer3                 DOB: 13/01/1980\nStart:04/12/2018       NA 2018-12-04

- AnilGoyal

0

另一种方法是将文本分割并处理更小的块。

逐步说明，使用一行数据

# Split the text on newlines, yielding dates with labels
dates$Notes %>% head(1) %>% strsplit("\n")

[[1]]
[1] "DOB: 12/10/62"     "START: 09/01/2019" "END: 09/01/2020"

深入挖掘下一层

# Split each name/value pair on colons
dates$Notes %>% head(1) %>% strsplit("\n") %>% 
    unlist() %>% strsplit(":\\s*")

[[1]]
[1] "DOB"      "12/10/62"

[[2]]
[1] "START"      "09/01/2019"

[[3]]
[1] "END"        "09/01/2020"

提取各个值

# extract a vector of name labels
dates$Notes %>% head(1) %>% strsplit("\n") %>% 
    unlist() %>% strsplit(":\\s*") %>%
    sapply(function(x) x[1])

[1] "DOB"   "START" "END" 


# extract a vector of associated values 
dates$Notes %>% head(1) %>% strsplit("\n") %>% 
    unlist() %>% strsplit(":\\s*") %>%
    sapply(function(x) x[2])

[1] "12/10/62"   "09/01/2019" "09/01/2020"

通过巧妙地使用dplyr，您将获得一个数据框架

dates %>%
    group_by(col1) %>%
    # summarize can collapse many rows into one or expand one into many
    summarize(
        name = Notes %>% strsplit("\n") %>%
            unlist() %>% strsplit(":\\s*") %>% 
            sapply(function(x) x[1]),
        value = Notes %>% strsplit("\n") %>% 
            unlist() %>% strsplit(":\\s*") %>% 
            sapply(function(x) x[2])
    ) %>%
    ungroup()

结果，所有的值都已分隔并准备好进一步处理。

# A tibble: 8 x 3
  col1      name   value     
  <chr>     <chr>  <chr>     
1 customer  DOB    12/10/62  
2 customer  START  09/01/2019
3 customer  END    09/01/2020
4 customer2 NA     NA        
5 customer2 S/DATE 28/08/19  
6 customer2 R/DATE 27/08/20  
7 customer3 DOB    13/01/1980
8 customer3 Start  04/12/2018

- Damian

0

我会结合使用 stringr 和 lubridate:

dates %>% 
  mutate(
    Start_Date = 
      sub("\ns/date:", "\nstart:", tolower(Notes)) %>% 
      str_remove_all("(.*\nstart:)|(\n.*)") %>% 
      trimws() %>% 
      lubridate::dmy()
  )

#        col1                                             Notes End_date Start_Date
# 1  customer DOB: 12/10/62\nSTART: 09/01/2019\nEND: 09/01/2020       NA 2019-01-09
# 2 customer2              \nS/DATE: 28/08/19\nR/DATE: 27/08/20       NA 2019-08-28
# 3 customer3                 DOB: 13/01/1980\nStart:04/12/2018       NA 2018-12-04

答案不是很简洁，但我觉得它直观且易于按照步骤进行操作。

首先，我用另一个（sub）替换了一个“start”模式，其中我使用tolower将所有字母转为小写。然后，我删除了开始日期之前的所有内容，并删除了换行符后的所有内容（str_remove_all）。最后，我修剪了空白字符（trimws）并将其转换为日期（lubridate::dmy）。

- WilliamGram

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Tim Biegeleisen · Accepted Answer

您可以在这里使用 sub 作为基本的R选项：

s_date <- ifelse(grepl("S/DATE", dates$Notes),
                 sub("^.*\\bS/DATE: (\\S+).*$", "\\1", dates$Notes), NA)
s_date

[1] NA         "28/08/19" NA

注意上面对grepl的调用是必要的，因为默认情况下sub会在文本中找不到S/DATE时返回整个输入字符串（在这种情况下是完整的Notes）。请保留HTML标签。