如何在R中为桑基图准备输入数据?

5
我正在尝试在R中生成桑基图,也称为河流图。我看到了这个问题在R中制作桑基图?,其中列出了许多可以生成桑基图的软件包。由于我已经有了输入数据并知道不同的工具/软件包,我可以生成这样的图表,但我的问题是:如何准备输入数据呢?
假设我们想展示用户在10天内如何在各个状态之间迁移,并且有以下起始数据集:
data.frame(userID = 1:100,
                     day1_state = sample(letters[1:8], replace = TRUE, size = 100),
                     day2_state = sample(letters[1:8], replace = TRUE, size = 100),
                     day3_state = sample(letters[1:8], replace = TRUE, size = 100),
                     day4_state = sample(letters[1:8], replace = TRUE, size = 100),
                     day5_state = sample(letters[1:8], replace = TRUE, size = 100),
                     day6_state = sample(letters[1:8], replace = TRUE, size = 100),
                     day7_state = sample(letters[1:8], replace = TRUE, size = 100),
                     day8_state = sample(letters[1:8], replace = TRUE, size = 100),
                     day9_state = sample(letters[1:8], replace = TRUE, size = 100),
                     day10_state = sample(letters[1:8], replace = TRUE, size = 100)
                     ) -> dt

现在,如果想使用networkD3创建桑基图,应该如何将这个dt数据框转换为所需的输入格式?
这样我们就可以得到像这个例子中的输入。
library(networkD3)
URL <- paste0(
        "https://cdn.rawgit.com/christophergandrud/networkD3/",
        "master/JSONdata/energy.json")
Energy <- jsonlite::fromJSON(URL)
# Plot
sankeyNetwork(Links = Energy$links, Nodes = Energy$nodes, Source = "source",
             Target = "target", Value = "value", NodeID = "name",
             units = "TWh", fontSize = 12, nodeWidth = 30)

编辑

我找到了一段脚本,可以在其他情况下准备数据,并复制了它,所以我认为现在可能已经关闭了:

https://github.com/mi2-warsaw/JakOniGlosowali/blob/master/sankey/sankey.R


1
不要关闭它,您也可以考虑提供您的问题的答案。这可能对其他人有所帮助。 - PavoDive
1
好的,我已经上传了一个带有示例和答案的代码 :) - Marcin
3个回答

2

我找到了一段脚本,可以在其他情况下准备数据,并重新制作它,因此我认为它可能已经关闭:

https://github.com/mi2-warsaw/JakOniGlosowali/blob/master/sankey/sankey.R

然后,此代码会为问题中提到的数据框生成这样的桑基图。

fixtable <- function(...) {
    tab <- table(...)
    if (substr(colnames(tab)[1],1,1) == "_" &
                substr(rownames(tab)[1],1,1) == "_") {
        tab2 <- tab
        colnames(tab2) <- sapply(strsplit(colnames(tab2), split=" "), `[`, 1)
        rownames(tab2) <- sapply(strsplit(rownames(tab2), split=" "), `[`, 1)
        tab2[1,1] <- 0
        # mandat w klubie
        for (par in names(which(tab2[1,] > 0))) {
            delta = min(tab2[par, 1], tab2[1, par])
            tab2[par, par] = tab2[par, par] + delta
            tab2[1, par] = tab2[1, par] - delta
            tab2[par, 1] = tab2[par, 1] - delta
        }
        # przechodzi przez niezalezy
        for (par in names(which(tab2[1,] > 0))) {
            tab2["niez.", par] = tab2["niez.", par] + tab2[1, par]
            tab2[1, par] = 0
        }
        for (par in names(which(tab2[,1] > 0))) {
            tab2[par, "niez."] = tab2[par, "niez."] + tab2[par, 1]
            tab2[par, 1] = 0
        }

        tab[] <- tab2[] 
    }
    tab
}


flow2 <- rbind(
    data.frame(fixtable(z = paste0(dat$day1_state, " day1"), do = paste0(dat$day2_state, " day2"))),
    data.frame(fixtable(z = paste0(dat$day2_state, " day2"), do = paste0(dat$day3_state, " day3"))),
    data.frame(fixtable(z = paste0(dat$day3_state, " day3"), do = paste0(dat$day4_state, " day4"))),
    data.frame(fixtable(z = paste0(dat$day4_state, " day4"), do = paste0(dat$day5_state, " day5"))),
    data.frame(fixtable(z = paste0(dat$day5_state, " day5"), do = paste0(dat$day6_state, " day6"))),
    data.frame(fixtable(z = paste0(dat$day6_state, " day6"), do = paste0(dat$day7_state, " day7"))),
    data.frame(fixtable(z = paste0(dat$day7_state, " day7"), do = paste0(dat$day8_state, " day8"))),
    data.frame(fixtable(z = paste0(dat$day8_state, " day8"), do = paste0(dat$day9_state, " day9"))),
    data.frame(fixtable(z = paste0(dat$day9_state, " day9"), do = paste0(dat$day10_state, " day10"))))

flow2 <- flow2[flow2[,3] > 0,]

nodes2 <- data.frame(name=unique(c(levels(factor(flow2[,1])), levels(factor(flow2[,2])))))
nam2 <- seq_along(nodes2[,1])-1
names(nam2) <- nodes2[,1]

links2 <- data.frame(source = nam2[as.character(flow2[,1])],
                                        target = nam2[as.character(flow2[,2])],
                                        value = flow2[,3])

sankeyNetwork(Links = links, Nodes = nodes,
                            Source = "source", Target = "target",
                            Value = "value", NodeID = "name",
                            fontFamily = "Arial", fontSize = 12, nodeWidth = 40,
                            colourScale = "d3.scale.category20()")

2

我之前问过类似的问题。 我想在这里发帖,介绍一下如何使用 tidyverse 实现它。

Original Answer 翻译成“最初的回答”。

library(ggplot2)
library(ggalluvial)
library(tidyr)
library(dplyr)
library(stringr)

# The actual data preperation happens here
dt_new  <- dt  %>% 
gather(day, state, -userID)  %>% # Long format
mutate(day = str_match(day, "[0-9]+")[,1])  %>% # Get the numbers 
  mutate(day = as.integer(day), # Convert to proper data types
         state = as.factor(state))

这是数据dt_new的样子。最初的回答。
   userID day state
1       1   1     d
2       2   1     d
3       3   1     g
4       4   1     a
5       5   1     a
6       6   1     d
7       7   1     d
8       8   1     b
9       9   1     d
10     10   1     e
...

现在绘制桑基图:

最初的回答:

  ggplot(dt_new,
       aes(x = day, stratum = state, alluvium = userID, fill = state, label = state)) +
  geom_stratum() +
  geom_text(stat = "stratum") +
  geom_flow()

这是输出结果。 图片描述 最初的回答。

0
七岁了,但仍值得更新。
这是我写的一个将宽数据转换为桑基图兼容的东西。
不过,使用它的人需要做一些编辑,函数中的sankey_pairs列表需要被编辑成你的数据中连续/链接的列对。
这个函数可以生成整个图表,但任何人都可以从中提取所需内容。
sankey_example <-
  expand(
    tibble(),
    x = sample(c('a', 'b', 'c'), size = 10, replace = TRUE),
    y = sample(c('j', 'k', 'l'), size = 10, replace = TRUE),
    z = sample(c('q', 'r', 's'), size = 10, replace = TRUE)
  ) |> 
    mutate(value = round(runif(n = n(), 0, 100)))

create_sankey <- function(data_var){
  
  sankey_pairs <- 
    list(
      c("x",  "y"),
      c("y",  "z"),
    )
  
  custSankey_transformed <- 
    map_dfr(sankey_pairs, function(col_var){
      data_var |>
        group_by(
          source = !!sym(col_var[[1]]),
          target = !!sym(col_var[[2]])
        ) |>
        summarise(value = sum(value, na.rm = TRUE), .groups = 'drop')
    })
  
  Sankey_nodes <- tibble(name = unique(c(Sankey_transformed$source, Sankey_transformed$target)))
  Sankey_transformed$IDsource <- match(Sankey_transformed$source, Sankey_nodes$name)-1
  Sankey_transformed$IDtarget <- match(Sankey_transformed$target, Sankey_nodes$name)-1
  
  out_sankeyNetwork <- 
    sankeyNetwork(
      Links     = Sankey_transformed, 
      Nodes     = Sankey_nodes,
      Source    = "IDsource", 
      Target    = "IDtarget",
      Value     = "value", 
      NodeID    = "name", 
      sinksRight= FALSE,
      fontSize  = 14
    )

  return(out_sankeyNetwork)
  
}


create_sankey(sankey_example)

太棒了!已收藏! - stats_noob

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接