基于节点权重构建图的算法优化

7
我正在尝试改进一个基于一些节点属性计算得分的网络构建函数。该函数试图从图中找到最佳子网络,使节点属性的乘积最大化。
该函数从一个随机节点开始,并在第一个邻居中开始搜索。如果有一些邻居的节点得分满足阈值,则将邻居添加到第一个节点,并继续此过程,直到无法再添加(添加邻居不会产生所需的得分增量)。如果第一个邻居中没有可以产生得分增量的节点,则函数查看二级邻居。在这种情况下,连接节点的路径可能有几条,对于这种情况,选择的路径将是权重最高的最短路径(其中之一节点具有最高权重)。
我可以对代码进行并行化,但我不知道如何在这种类型的函数中实现它。
以下是该函数的代码:
build_network <-
function (G, seed, d= 2){
    net <- G
    d <- d

    score.fun<-function(g){
    Za <- sum(V(g)$weight*V(g)$RWRNodeweight)/sqrt(sum(V(g)$RWRNodeweight^2))
    k <- vcount(g)
    tmp <- genesets.length.null.stat[[as.character(k)]] # genesets.length.null.stat is a list with the median of Za and sd of Za calculated for 1000 replicates of networks of size k
    Sa <- (Za-tmp[1])/tmp[2]
    }

    best.fun<-function(in.nodes,out.nodes)  {
    score<-(-Inf); best<-character()
    for(node in out.nodes){
     subG.update<-induced.subgraph(net, c(in.nodes,node))
     if( score.fun(subG.update) > score ){
        score<-score.fun(subG.update)
        best<-node
        }
    }
    list("node"=best,"score"=score)
    }   

    subG <- induced.subgraph(net, seed)
    if (!is.connected(subG)) {          #the seed must be connected
        stop("Input seeds are disjoint")
    }

    while (TRUE) {
        in.nodes <- V(subG)$name
        node_num <- vcount(subG)
        subsum <- score.fun(subG)
        #subx <- V(subG)$name
        for (rad in 1:d) {
            tmp.neigh <- unlist(neighborhood(net, order = rad, nodes = V(subG)$name)) 
            pot.nodes <- V(net)[tmp.neigh]$name
            out.nodes <- setdiff(pot.nodes, in.nodes)
            if (length(out.nodes) == 0) break

            best_node<-best.fun(in.nodes, out.nodes) 
            new_score<-best_node$score
            best_node<-best_node$node 

            if (new_score > subsum + 0.01) {
                tmp <- unlist(lapply(best_node, function(x) node2treePath(net,V(subG)$name, x))) # node2treePath is a function to retrieve the shortest path with the highest node weights
                in.nodes <- c(tmp, V(subG)$name)
                subG <- induced.subgraph(net, in.nodes)
                break
            }
        }
        if (node_num == vcount(subG)) break
    }
    return(subG)
}

我正在尝试将此函数应用于大约10,000个节点的图。以下是运行该函数的代码近似:

### generate some example data
library(igraph)
my_graph <- erdos.renyi.game(10000, 0.0003)
V(my_graph)$name <- 1:vcount(my_graph)
V(my_graph)$weight <- rnorm(10000)
V(my_graph)$RWRNodeweight <- runif(10000, min=0, max=0.05)

### Run the function
sublist = list()
for (node in V(G)$name) {
    subnet <- build_network(G, node, d)
    sublist[[node]] <- subnet }

编辑:这是head(genesets.length.null.stat)dput

structure(list(`1` = c(1.01397367504035, 1.18858228819048), `2` = c(1.61970348041337, 1.30189433386605), `3` = c(2.11767222957028, 1.36222065695878), `4` = c(2.47710421934929, 1.36968129959296), `5` = c(2.776011866622, 1.36318885187196), `6` = c(3.16885126246671, 1.42577861995897)), .Names = c("1", "2", "3", "4", "5", "6"))

这是node2treePath函数:

node2treePath <- function (G, Tnodes, node){

tmp.path <- get.all.shortest.paths(G, node, Tnodes)$res
tmp.l <- unlist(lapply(tmp.path, length))
index <- which(tmp.l == min(tmp.l))

tmp.path = tmp.path[index]
tmp.sum <- unlist(lapply(tmp.path, function(x)return(sum(V(G)[x]$weight))))
index <- which(tmp.sum == max(tmp.sum))

selected.path = tmp.path[index]
collect <- unlist(lapply(selected.path, function(x)return(V(G)[x]$name)))

return(collect)
}

你能让这个例子可复现吗?例如,genesets.length.null.stat 不存在。 - Jack Wasey
genesets.length.null.stat 是一个包含 500 个元素的列表,每个元素都存储了两个值(均值和标准差)。 - user2380782
@user2380782,你能提供一个genesets.length.null.stat的例子吗?例如:dput(head(genesets.length.null.stat)),或者一些生成随机genesets.length.null.stat对象的代码? - josliber
@josilber,实际上genesets.length.null.stat只是在运行你之前的帖子https://dev59.com/9FwY5IYBdhLWcg3wCUA7中的`josilber.rcpp`函数后得到的列表,例如`genesets.length.null.stat <- lapply(1:500, function(x)josilber.rcpp(x,1000,my_graph),然后计算列表每个组件的平均值和标准差。我已经按照你的建议添加了dput`。非常感谢,如果你需要其他任何东西,请告诉我。 - user2380782
你没有定义 node2treePath。此外,best_fun 应该是 best.fun。即使是赏金问题,你仍然需要编写可重现的示例,以便人们可以实际运行代码来解决你所述的特定问题。你可以提供一个等效的虚拟函数或者真实的函数。 - Jack Wasey
@JackWasey,我已经编辑了帖子并添加了node2treePath函数,代码可以工作但显然不是最快的,感谢您的帮助,对于没有以更好的方式发布而感到抱歉。 - user2380782
3个回答

0

由于您的分数函数仅取决于节点属性而不是边缘,因此解决方案并不唯一;您可能希望寻找最佳树。如果您重构问题,使节点成为边缘,反之亦然,则可能只需使用例如Djikstra算法即可找到最佳解。这已经在igraph包中作为shortest.paths()实现。


0

我无法阅读R代码,但根据您的描述:如果分数阈值是恒定的,那么在O(|V|+|E|+|C|^2)时间内很容易完成,其中|C|是“好”组件的数量(这将很快解释)。

首先,删除所有得分低于阈值的节点。然后在这个新图中找到所有连接的组件(可以通过从每个尚未访问的节点开始DFS来在O(|V|+|E|)时间内完成),通过将组件中所有顶点权重相乘来计算它们的分数,并用其组件ID标记每个顶点。这已经告诉您“好”的组件--不需要任何二度连接的组件。

假设这样会产生|C|个组件。创建一个空哈希表H,其中键为组件ID对,值为(长度、重量)对。现在回到第一遍删除的每个顶点v:对于每个顶点,查看所有相邻顶点并记录到每个不同组件的最短边缘(可以使用长度-|C|数组来存储到目前为止所见到的每个组件的最短边缘)。在检查完v的所有相邻顶点后,统计落入k个不同组件的相邻顶点数k:如果k≥2,则v可能应该用于连接这k(k-1)/2对组件中的某些组件对。对于可能由v连接的每对不同组件i和j,在必要时更新H的权重和距离以反映这个2边缘连接:也就是说,如果i和j尚未连接,则记录v将它们连接起来;否则,如果它们已经由某个顶点u连接,仅在v可以更好地完成任务时才更新H(即,如果v使用的总长度更短且更大的权重比u更好)。这一步可以被认为是在原始修剪图的“组件图”中构建最小生成树。每个新的“组合”组件的分数可以通过乘以两个组成部分的分数来轻松计算。
最后,只需返回乘积最大的组件即可。

谢谢你的建议。我会仔细研究并尝试实现这些代码,但其实从来没有想过使用哈希。 - user2380782
不客气!如果有任何不清楚的地方,请告诉我。 - j_random_hacker

0

针对您想要实现的逻辑(我想您可能希望以与上述答案不兼容的方式进行更改),以下代码速度约快 十倍 30%。我使用了 Rprofprofr,并以微不足道的方式重新编码了一些缓慢的部分,例如不传递命名列表对,而是从您的某个函数中传递匿名对。具有数值名称的列表对于 genesets.length.null.stat 的值非常低效。我用两个数字向量替换了它。您还经常调用 'V' 函数,这是一个大的时间消耗者:如您所见,您可以调用它一次,然后根据需要查询结果。

# node2treePath is a function to retrieve the shortest path with the highest node weights
node2treePath_jw <- function(G, Tnodes, node){

  tmp.path <- get.all.shortest.paths(G, node, Tnodes)$res
  tmp.l <- vapply(tmp.path, length, integer(1))
  index <- which(tmp.l == min(tmp.l))

  tmp.path = tmp.path[index]
  Vg <- V(G)
  tmp.sum <- vapply(tmp.path, function(x) sum(Vg[x]$weight), numeric(1))
  index <- which(tmp.sum == max(tmp.sum))

  selected.path = tmp.path[index]
  sapply(selected.path, function(x) Vg[x]$name)
}

build_network_jw <- function(net, seed, d= 2){

  score.fun <- function(Vg, k){
    Za <- sum(Vg$weight * Vg$RWRNodeweight) / sqrt(sum(Vg$RWRNodeweight^2))
    (Za - genesets_jack_a[k]) / genesets_jack_b[k]
  }

  best.fun_jw <- function(in.nodes, out.nodes)  {
    score <- (-Inf)
    best <- character()
    for (node in out.nodes) {
      subG.update <- induced.subgraph(net, c(in.nodes,node))
      Vsgu <- V(subG.update)
      Vsgu_count <- vcount(subG.update)
      sf <- score.fun(Vsgu, Vsgu_count)
      if (sf > score) {
        score <- sf
        best <- node
      }
    }
    list(best, score)
  }

  subG <- induced.subgraph(net, seed)
  if (!is.connected(subG)) {          #the seed must be connected
    stop("Input seeds are disjoint")
  }

  while (TRUE) {
    VsubG <- V(subG)
    Vnet <- V(net)
    in.nodes <- VsubG$name
    node_num <- vcount(subG)
    subsum <- score.fun(VsubG, node_num)

    for (rad in 1:d) { # d = 2
      tmp.neigh <- unlist(neighborhood(net, order = rad, nodes = VsubG$name))
      pot.nodes <- Vnet[tmp.neigh]$name
      out.nodes <- setdiff(pot.nodes, in.nodes)
      if (length(out.nodes) == 0) break

      best_node <- best.fun_jw(in.nodes, out.nodes)
      new_score <- best_node[[2]]
      best_node <- best_node[[1]]

      if (new_score > subsum + 0.01) {
        tmp <- sapply(best_node, function(x) node2treePath_jw(net, VsubG$name, x))
        in.nodes <- c(tmp, VsubG$name)
        subG <- induced.subgraph(net, in.nodes)
        break
      }
    }
    if (node_num == vcount(subG)) break
  }
  subG
}

node2treePath <- function (G, Tnodes, node){

  tmp.path <- get.all.shortest.paths(G, node, Tnodes)$res
  tmp.l <- unlist(lapply(tmp.path, length))
  index <- which(tmp.l == min(tmp.l))

  tmp.path = tmp.path[index]
  tmp.sum <- unlist(lapply(tmp.path, function(x)return(sum(V(G)[x]$weight))))
  index <- which(tmp.sum == max(tmp.sum))

  selected.path = tmp.path[index]
  collect <- unlist(lapply(selected.path, function(x)return(V(G)[x]$name)))

  return(collect)
}


build_network <- function (net, seed, d= 2){

  #genesets.length.null.stat <- structure(list(`1` = c(1.01397367504035, 1.18858228819048), `2` = c(1.61970348041337, 1.30189433386605), `3` = c(2.11767222957028, 1.36222065695878), `4` = c(2.47710421934929, 1.36968129959296), `5` = c(2.776011866622, 1.36318885187196), `6` = c(3.16885126246671, 1.42577861995897)), .Names = c("1", "2", "3", "4", "5", "6"))
  genesets.length.null.stat <- lapply(1:500, function(x) c(runif(1)+x, runif(1)+x))
  names(genesets.length.null.stat) <- 1:500

  score.fun<-function(g){
    Za <- sum(V(g)$weight*V(g)$RWRNodeweight)/sqrt(sum(V(g)$RWRNodeweight^2))
    k <- vcount(g)
    tmp <- genesets.length.null.stat[[as.character(k)]] # genesets.length.null.stat is a list with the median of Za and sd of Za calculated for 1000 replicates of networks of size k
    Sa <- (Za-tmp[1])/tmp[2]
  }

  best.fun <- function(in.nodes,out.nodes)  {
    score<-(-Inf); best<-character()
    for (node in out.nodes){
      subG.update<-induced.subgraph(net, c(in.nodes,node))
      if (score.fun(subG.update) > score) {
        score<-score.fun(subG.update)
        best<-node
      }
    }
    list("node"=best,"score"=score)
  }

  subG <- induced.subgraph(net, seed)
  if (!is.connected(subG)) {          #the seed must be connected
    stop("Input seeds are disjoint")
  }

  while (TRUE) {
    in.nodes <- V(subG)$name
    node_num <- vcount(subG)
    subsum <- score.fun(subG)
    #subx <- V(subG)$name
    for (rad in 1:d) {
      tmp.neigh <- unlist(neighborhood(net, order = rad, nodes = V(subG)$name))
      pot.nodes <- V(net)[tmp.neigh]$name
      out.nodes <- setdiff(pot.nodes, in.nodes)
      if (length(out.nodes) == 0) break

      #message("length in.nodes = ", length(in.nodes))
      #message("length out.nodes = ", length(out.nodes))

      best_node<-best.fun(in.nodes, out.nodes)
      new_score<-best_node$score
      best_node<-best_node$node

      if (new_score > subsum + 0.01) {
        tmp <- unlist(lapply(best_node, function(x) node2treePath(net,V(subG)$name, x))) # node2treePath is a function to retrieve the shortest path with the highest node weights
        in.nodes <- c(tmp, V(subG)$name)
        subG <- induced.subgraph(net, in.nodes)
        break
      }
    }
    if (node_num == vcount(subG)) break
  }
  subG
}

library(igraph)
library(profr)



library(igraph)
library(profr)

#genesets.length.null.stat <- lapply(1:500, function(x) c(runif(1)+x, runif(1)+x))
#names(genesets.length.null.stat) <- 1:500

set.seed(1)
genesets_jack_a = runif(500) + 1:500
genesets_jack_b = runif(500) + 1:500

do_it_jw <- function(n = 1000){

  my_graph <- erdos.renyi.game(n, 0.0003)
  V(my_graph)$name <- 1:vcount(my_graph)
  V(my_graph)$weight <- rnorm(n)
  V(my_graph)$RWRNodeweight <- runif(n, min = 0, max = 0.05)

  ### Run the function
  sublist = list()
  Vmg <- V(my_graph)
  for (node in Vmg$name) {
    #message(node)
    subnet <- build_network_jw(my_graph, node, 2)
    sublist[[node]] <- subnet }
}

do_it <- function(n = 1000){

  my_graph <- erdos.renyi.game(n, 0.0003)
  V(my_graph)$name <- 1:vcount(my_graph)
  V(my_graph)$weight <- rnorm(n)
  V(my_graph)$RWRNodeweight <- runif(n, min = 0, max = 0.05)

  ### Run the function
  sublist = list()
  Vmg <- V(my_graph)
  for (node in Vmg$name) {
    #message(node)
    subnet <- build_network(my_graph, node, 2)
    sublist[[node]] <- subnet }
}

library(microbenchmark)
mb <- microbenchmark(do_it(1000), do_it_jw(1000), times = 5)
print(mb)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接