如何将数据框转换为树形结构?

3

这是一个简单的分类法(标签和ID):

test_data <- data.frame(
  cat_id = c(661, 197, 228, 650, 126, 912, 949, 428),
  cat_h1 = c(rep("Animals", 5), rep("Plants", 3)),
  cat_h2 = c(rep("Mammals", 3), rep("Birds", 2), c("Wheat", "Grass", "Other")),
  cat_h3 = c("Dogs", "Dogs", "Other", "Hawks", "Other", rep(NA, 3)),
  cat_h4 = c("Big", "Little", rep(NA, 6)))

解析后的结构应该与以下匹配:
list(
  Animals = list(Mammals = list(Dogs  = list(Big = 661, Little = 197), Other = 228),
                 Birds   = list(Hawks = 650, Other = 126)),
  Plants  = list(Wheat = 912, Grass = 949, Other = 428))
3个回答

7

如果您可以接受订单略微改变,这是一种按列处理的递归解决方案:

f <- function(x, d=cbind(x,NA)) {
    c( 
       # call f by branch
       if(ncol(d) > 3) local({
         x <- d[!is.na(d[[3]]),] 
         by( x[-2], droplevels(x[2]), f, x=NA, simplify=FALSE) 
       }), 
       # leaf nodes
       setNames(as.list(d[[1]]), d[[2]])[is.na(d[[3]])] 
    )
}

这将会产生这样的结果:
> str(f(test_data))
List of 2
 $ Animals:List of 2
  ..$ Birds  :List of 2
  .. ..$ Hawks: num 650
  .. ..$ Other: num 126
  ..$ Mammals:List of 2
  .. ..$ Dogs :List of 2
  .. .. ..$ Big   : num 661
  .. .. ..$ Little: num 197
  .. ..$ Other: num 228
 $ Plants :List of 3
  ..$ Wheat: num 912
  ..$ Grass: num 949
  ..$ Other: num 428

不错!我使用类似的“by/split”逻辑最接近的代码是 with(test_data,Map(split,split(cat_id,cat_h1),split(cat_h2,cat_h1))) ,但后来一切都崩溃了。 - thelatemail
顺序不重要!递归也可以。非常感谢! - dholstius

3
也许不是最高效的方式,但也不算太难: 创建数据:
test_data <- data.frame(
  cat_id = c(661, 197, 228, 650, 126, 912, 949, 428),
  cat_h1 = c(rep("Animals", 5), rep("Plants", 3)),
  cat_h2 = c(rep("Mammals", 3), rep("Birds", 2), c("Wheat", "Grass", "Other")),
  cat_h3 = c("Dogs", "Dogs", "Other", "Hawks", "Other", rep(NA, 3)),
  cat_h4 = c("Big", "Little", rep(NA, 6)))

循环遍历数据帧并构建列表/树:

tax <- list()  ## initialize
for (i in 1:nrow(test_data)) {
    ## convert data.frame row to character vector
    taxdat <- sapply(test_data[i,-1],as.character)
    taxstr <- character(0)  ## initialize taxon string
    ntax <- length(na.omit(taxdat))
    for (j in 1:ntax) {
        taxstr <- c(taxstr,taxdat[j])  ## build string
        if (is.null(tax[[taxstr]])) {
            tax[[taxstr]] <- list()  ## initialize if necessary
        }
    }
    tax[[taxstr]] <- test_data$cat_id[i]  ## assign value to tip
}

将结果与期望结果进行比较:

res <- list(
  Animals = list(Mammals = list(Dogs  = list(Big = 661, Little = 197),
                 Other = 228),
                 Birds   = list(Hawks = 650, Other = 126)),
  Plants  = list(Wheat = 912, Grass = 949, Other = 428))

all.equal(res,tax)  ## TRUE

жҲ‘и§үеҫ—дҪҝз”ЁReduce()жҲ–иҖ…split()дёҖе®ҡжңүи§ЈеҶіж–№жЎҲпјҢдҪҶжҳҜжҲ‘зҺ°еңЁжғідёҚеҮәжқҘгҖӮ - thelatemail
@TheTime 感谢您指出 'data.tree' 包的指针。+1! - dholstius

1
我会避免使用列表结构,而更喜欢整洁的数据。以下是减少数据冗余的方法。
library(dplyr)

h1_h2 = 
  test_data %>%
  select(cat_h1, cat_h2) %>%
  distinct %>%
  filter(cat_h2 %>% is.na %>% `!`)

h2_h3 =
  test_data %>%
  select(cat_h2, cat_h3) %>%
  distinct %>%
  filter(cat_h3 %>% is.na %>% `!`)

h3_h4 = 
  test_data %>%
  select(cat_h3, cat_h4) %>%
  distinct %>%
  filter(cat_h4 %>% is.na %>% `!`)

原文可以很容易地重建:
h1_h2 %>%
  left_join(h2_h3) %>%
  left_join(h3_h4)

编辑:这里有一种自动化整个过程的方法。
library(dplyr)
library(lazyeval)

adjacency = function(data) {
  adjacency_table = function(data, larger_name, smaller_name)
    lazy(data %>%
           select(larger_name, smaller_name) %>%
           distinct %>%
           filter(smaller_name %>% is.na %>% `!`) ) %>%
    interp(larger_name = larger_name %>% as.name, 
           smaller_name = smaller_name %>% as.name) %>%
    lazy_eval %>%
    setNames(c("larger", "smaller"))

  data_frame(smaller_name = data %>% names) %>%
    mutate(larger_name = smaller_name %>% lag) %>%
    slice(-1) %>%
    group_by(larger_name, smaller_name) %>%
    do(adjacency_table(data, .$larger_name, .$smaller_name) )
}

result = 
  test_data %>%
  select(-cat_id) %>%
  adjacency

5
但这完全不是 OP 所问的。我能理解“这样做不太好,这样做更好”的想法,但这似乎十分离题... - Ben Bolker
@BenBolker 从技术上讲,这是不相关的话题,但实际上(幸运地?)预见了我的紧急需求,即将树重新表示为邻接列表形式(而不是原始的“列谱”形式)! - dholstius
我可以看出这个代码块可以被泛化,包装在'lapply'中,然后连接到'bind_rows'。也许只需要再加一步就可以转换成'Reduce'了。但是,这里有一个潜在的问题,如果有两个或更多的节点具有相同的标签(但从根开始的路径不同),则可能会出现歧义/冲突问题,这一点在原始帖子中没有表现出来。 - dholstius
我已经使用新的自动化版本进行了编辑。是的,存在歧义的可能性。然而,如果确实存在两个或更多节点具有相同标签但不同路径的情况,那么原始表中实际上不存在冗余,可以保留原样。 - bramtayl

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接