如何将数据框转换为树形结构？

Question

如何将数据框转换为树形结构？

3

这是一个简单的分类法（标签和ID）：

test_data <- data.frame(
  cat_id = c(661, 197, 228, 650, 126, 912, 949, 428),
  cat_h1 = c(rep("Animals", 5), rep("Plants", 3)),
  cat_h2 = c(rep("Mammals", 3), rep("Birds", 2), c("Wheat", "Grass", "Other")),
  cat_h3 = c("Dogs", "Dogs", "Other", "Hawks", "Other", rep(NA, 3)),
  cat_h4 = c("Big", "Little", rep(NA, 6)))

解析后的结构应该与以下匹配：

list(
  Animals = list(Mammals = list(Dogs  = list(Big = 661, Little = 197), Other = 228),
                 Birds   = list(Hawks = 650, Other = 126)),
  Plants  = list(Wheat = 912, Grass = 949, Other = 428))

- dholstius

3个回答

3

也许不是最高效的方式，但也不算太难：创建数据：

test_data <- data.frame(
  cat_id = c(661, 197, 228, 650, 126, 912, 949, 428),
  cat_h1 = c(rep("Animals", 5), rep("Plants", 3)),
  cat_h2 = c(rep("Mammals", 3), rep("Birds", 2), c("Wheat", "Grass", "Other")),
  cat_h3 = c("Dogs", "Dogs", "Other", "Hawks", "Other", rep(NA, 3)),
  cat_h4 = c("Big", "Little", rep(NA, 6)))

循环遍历数据帧并构建列表/树：

tax <- list()  ## initialize
for (i in 1:nrow(test_data)) {
    ## convert data.frame row to character vector
    taxdat <- sapply(test_data[i,-1],as.character)
    taxstr <- character(0)  ## initialize taxon string
    ntax <- length(na.omit(taxdat))
    for (j in 1:ntax) {
        taxstr <- c(taxstr,taxdat[j])  ## build string
        if (is.null(tax[[taxstr]])) {
            tax[[taxstr]] <- list()  ## initialize if necessary
        }
    }
    tax[[taxstr]] <- test_data$cat_id[i]  ## assign value to tip
}

将结果与期望结果进行比较：

res <- list(
  Animals = list(Mammals = list(Dogs  = list(Big = 661, Little = 197),
                 Other = 228),
                 Birds   = list(Hawks = 650, Other = 126)),
  Plants  = list(Wheat = 912, Grass = 949, Other = 428))

all.equal(res,tax)  ## TRUE

- Ben Bolker

жҲ‘и§үеҫ—дҪҝз”ЁReduce()жҲ–иҖ…split()дёҖе®ҡжңүи§ЈеҶіж–№жЎҲпјҢдҪҶжҳҜжҲ‘зҺ°еңЁжғідёҚеҮәжқҘгҖӮ - thelatemail

@TheTime 感谢您指出 'data.tree' 包的指针。+1！ - dholstius

1

我会避免使用列表结构，而更喜欢整洁的数据。以下是减少数据冗余的方法。

library(dplyr)

h1_h2 = 
  test_data %>%
  select(cat_h1, cat_h2) %>%
  distinct %>%
  filter(cat_h2 %>% is.na %>% `!`)

h2_h3 =
  test_data %>%
  select(cat_h2, cat_h3) %>%
  distinct %>%
  filter(cat_h3 %>% is.na %>% `!`)

h3_h4 = 
  test_data %>%
  select(cat_h3, cat_h4) %>%
  distinct %>%
  filter(cat_h4 %>% is.na %>% `!`)

原文可以很容易地重建：

h1_h2 %>%
  left_join(h2_h3) %>%
  left_join(h3_h4)

编辑：这里有一种自动化整个过程的方法。

library(dplyr)
library(lazyeval)

adjacency = function(data) {
  adjacency_table = function(data, larger_name, smaller_name)
    lazy(data %>%
           select(larger_name, smaller_name) %>%
           distinct %>%
           filter(smaller_name %>% is.na %>% `!`) ) %>%
    interp(larger_name = larger_name %>% as.name, 
           smaller_name = smaller_name %>% as.name) %>%
    lazy_eval %>%
    setNames(c("larger", "smaller"))

  data_frame(smaller_name = data %>% names) %>%
    mutate(larger_name = smaller_name %>% lag) %>%
    slice(-1) %>%
    group_by(larger_name, smaller_name) %>%
    do(adjacency_table(data, .$larger_name, .$smaller_name) )
}

result = 
  test_data %>%
  select(-cat_id) %>%
  adjacency

- bramtayl

5

但这完全不是 OP 所问的。我能理解“这样做不太好，这样做更好”的想法，但这似乎十分离题... - Ben Bolker

@BenBolker 从技术上讲，这是不相关的话题，但实际上（幸运地？）预见了我的紧急需求，即将树重新表示为邻接列表形式（而不是原始的“列谱”形式）！ - dholstius

我可以看出这个代码块可以被泛化，包装在'lapply'中，然后连接到'bind_rows'。也许只需要再加一步就可以转换成'Reduce'了。但是，这里有一个潜在的问题，如果有两个或更多的节点具有相同的标签（但从根开始的路径不同），则可能会出现歧义/冲突问题，这一点在原始帖子中没有表现出来。 - dholstius

我已经使用新的自动化版本进行了编辑。是的，存在歧义的可能性。然而，如果确实存在两个或更多节点具有相同标签但不同路径的情况，那么原始表中实际上不存在冗余，可以保留原样。 - bramtayl

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Neal Fultz · Accepted Answer

如果您可以接受订单略微改变，这是一种按列处理的递归解决方案：

f <- function(x, d=cbind(x,NA)) {
    c( 
       # call f by branch
       if(ncol(d) > 3) local({
         x <- d[!is.na(d[[3]]),] 
         by( x[-2], droplevels(x[2]), f, x=NA, simplify=FALSE) 
       }), 
       # leaf nodes
       setNames(as.list(d[[1]]), d[[2]])[is.na(d[[3]])] 
    )
}

这将会产生这样的结果：

> str(f(test_data))
List of 2
 $ Animals:List of 2
  ..$ Birds  :List of 2
  .. ..$ Hawks: num 650
  .. ..$ Other: num 126
  ..$ Mammals:List of 2
  .. ..$ Dogs :List of 2
  .. .. ..$ Big   : num 661
  .. .. ..$ Little: num 197
  .. ..$ Other: num 228
 $ Plants :List of 3
  ..$ Wheat: num 912
  ..$ Grass: num 949
  ..$ Other: num 428