将由多个数据框(n x 2 数据框)组成的列表整理为单个数据框(n x 3 列),整理后的数据更加清晰。

3
我从许多文件夹中读取小文本文件并将它们存储到列表中。因此,我有一个长度为n的列表,其中包含2个data.frames
以下是列表中第3个元素的示例(请参阅问题末尾的dput)。
ip_list[[3]]
$`dc:a6:32:2d:b6:c4`
# A tibble: 2 x 1
  X1                               
  <chr>                            
1 MAC address is: dc:a6:32:2d:b6:c4
2 IP is: 18.21.162.74              

$`dc:a6:32:2d:b6:c4_running`
# A tibble: 1 x 1
  datetime           
  <dttm>             
1 2020-03-13 19:11:07

我的目标是将列表转换为带有n台机器和3列(mac、ip、datetime)的数据框。我已经使用了可能有些繁琐的方式来完成这个任务:

n_machines <- length(ip_list)

# first element will be the mac and ip
df  <- lapply(1:n_machines,
 function (xx) as.data.frame(t(ip_list[[xx]][[1]]),
    stringsAsFactors = FALSE)) %>%
    bind_rows() %>%
    # now clean
    rename(mac = V1, ip = V2) %>% 
    mutate(mac = str_remove(mac, "MAC address is: "),
           ip = str_remove(ip, "IP is: "))

# second element will be running time

running_time <- lapply(1:n_machines,
 function (xx) as.data.frame(t(ip_list[[xx]][[2]]),
  stringsAsFactors = FALSE)) %>%
    bind_rows() %>%
    rename(datetime = V1)


# join stuff (order should be kept)
df <- bind_cols(df, running_time)

这将产生预期的结果:

df
                 mac            ip            datetime
1  dc:a6:32:21:59:2b  18.21.129.94                    
2  dc:a6:32:2d:8c:ca 18.21.171.210                    
3  dc:a6:32:2d:b6:c4  18.21.162.74 2020-03-13 19:11:07
4  dc:a6:32:2d:b8:62  18.21.178.96                                    

问题:有没有更好的方法?我感觉应该有一种方法来完成这个任务。特别是:

  • 依赖元素的顺序可能会隐藏问题(与其依赖1:n的顺序,我更希望有一种通过MAC地址合并的方法)
  • 整个lapply的过程虽然能够完成任务,但我感觉它很难调试(如果n_machines = 0,则肯定会出问题)
  • 使用as.data.frame(t(...))非常麻烦,而且我会丢失名称(之后必须重新分配),但我找不到pivot_wider或类似方法。

这里是一个小的dput示例:

dput(ip_list[1:4])
list(list(`dc:a6:32:21:59:2b` = structure(list(X1 = c("MAC address is: dc:a6:32:21:59:2b", 
"IP is: 18.21.129.94")), class = c("spec_tbl_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -2L), spec = structure(list(
    cols = list(X1 = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 0), class = "col_spec")), `dc:a6:32:21:59:2b_running` = structure(list(
    datetime = ""), class = "data.frame", row.names = c(NA, -1L
))), list(`dc:a6:32:2d:8c:ca` = structure(list(X1 = c("MAC address is: dc:a6:32:2d:8c:ca", 
"IP is: 18.21.171.210")), class = c("spec_tbl_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -2L), spec = structure(list(
    cols = list(X1 = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 0), class = "col_spec")), `dc:a6:32:2d:8c:ca_running` = structure(list(
    datetime = ""), class = "data.frame", row.names = c(NA, -1L
))), list(`dc:a6:32:2d:b6:c4` = structure(list(X1 = c("MAC address is: dc:a6:32:2d:b6:c4", 
"IP is: 18.21.162.74")), class = c("spec_tbl_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -2L), spec = structure(list(
    cols = list(X1 = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 0), class = "col_spec")), `dc:a6:32:2d:b6:c4_running` = structure(list(
    datetime = structure(1584126667.65542, class = c("POSIXct", 
    "POSIXt"), tzone = "UTC")), class = c("spec_tbl_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -1L), spec = structure(list(
    cols = list(datetime = structure(list(format = ""), class = c("collector_datetime", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 0), class = "col_spec"))), list(`dc:a6:32:2d:b8:62` = structure(list(
    X1 = c("MAC address is: dc:a6:32:2d:b8:62", "IP is: 18.21.178.96"
    )), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L), spec = structure(list(cols = list(
    X1 = structure(list(), class = c("collector_character", "collector"
    ))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 0), class = "col_spec")), `dc:a6:32:2d:b8:62_running` = structure(list(
    datetime = ""), class = "data.frame", row.names = c(NA, -1L
))))

更新

这是一个与 dplyr 相关的问题,它在我升级到开发版(目前版本为 0.8.99.9002)时得到了解决。

两个回答都提供了预期的结果,并且我认为它们都是改进。我认为被接受的答案更易于阅读,但这是高度主观的。我的唯一担忧是,我的以前的 lapply 选项随着时间的推移可能会变得相当稳定,而 purrr/dplyr 则经常会被淘汰。

2个回答

2

一个涉及到 dplyr, tidyr, purrrstringr 的选项可能是:

map_dfr(.x = ip_list, ~ .x %>% 
         bind_rows() %>%
         mutate_all(as.character) %>%
         pivot_longer(everything(), values_drop_na = TRUE), .id = "ID") %>%
 mutate(name = if_else(name == "datetime", name, str_extract(value, "^MAC|^IP")),
        value = str_remove(value, ".*: ")) %>%
 pivot_wider(names_from = "name", values_from = "value") 

  ID    MAC               IP            datetime           
  <chr> <chr>             <chr>         <chr>              
1 1     dc:a6:32:21:59:2b 18.21.129.94  ""                 
2 2     dc:a6:32:2d:8c:ca 18.21.171.210 ""                 
3 3     dc:a6:32:2d:b6:c4 18.21.162.74  2020-03-13 19:11:07
4 4     dc:a6:32:2d:b8:62 18.21.178.96  ""        

如果您需要将日期时间转换为实际的日期和时间,可以使用 lubridate 库:

map_dfr(.x = ip_list, ~ .x %>% 
         bind_rows() %>%
         mutate_all(as.character) %>%
         pivot_longer(everything(), values_drop_na = TRUE), .id = "ID") %>%
 mutate(name = if_else(name == "datetime", name, str_extract(value, "^MAC|^IP")),
        value = str_remove(value, ".*: ")) %>%
 pivot_wider(names_from = "name", values_from = "value") %>%
 mutate(datetime = ymd_hms(datetime))

1
一种选择是使用map循环遍历“ip_list”,绑定列(bind_cols),将“datetime”列转换为DateTime类,因为某些元素仅具有空值,然后使用map_dfr将“X1”列(转换为“data.frame”)分成两部分,并使用pivot_wider将其重塑为“wide”格式。
library(dplyr)
library(purrr)
library(stringr)
library(lubridate)
map_dfr(ip_list,  ~ 
       .x %>%
           bind_cols %>%
           mutate(datetime = ymd_hms(datetime)), .id = 'grp')  %>% 
   separate(X1, into = c('grp1','val'), sep=" is: ") %>%
   mutate(grp1 = word(grp1, 1)) %>%
   pivot_wider(names_from = grp1, values_from = val) %>%
   select(mac = MAC, ip = IP, datetime) 
# A tibble: 4 x 3
#  mac               ip            datetime           
#  <chr>             <chr>         <dttm>             
#1 dc:a6:32:21:59:2b 18.21.129.94  NA                 
#2 dc:a6:32:2d:8c:ca 18.21.171.210 NA                 
#3 dc:a6:32:2d:b6:c4 18.21.162.74  2020-03-13 19:11:07
#4 dc:a6:32:2d:b8:62 18.21.178.96  NA            

我在使用 map_dfr() 命令时遇到了 Error: Argument 2 must be length 2, not 1 的错误提示,可能是因为我使用的 purrr 版本过旧(0.3.3)? - Matias Andina
@MatiasAndina 这个错误是基于你展示的相同例子吗?我也有相同的 purrr 版本。 - akrun
@MatiasAndina,您是否在使用map(ip_list, ~ .x %>% bind_cols %>% mutate(datetime = ymd_hms(datetime)))时遇到了错误? - akrun
@MatiasAndina 那么,这一定是 dplyr 的问题。例如,map(ip_list, bind_cols) 对我有效。 - akrun
当你执行bind_cols(ip_list [ [1]])时,是否有任何错误?我发现其中一个数据集是 data.frame,另一个是 tibble,但这不应该有任何问题。 - akrun

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接