如何使用Perl/R在图形中找到低区域？

Question

如何使用Perl/R在图形中找到低区域？

perlrstatisticsgraph

4

我正在研究一些生物数据，基本上是一个很长的整数列表（几百万个值），每个值表示基因组中这个位置的覆盖深度。下面是一个数据集的图形示例： alt text

我想在这些数据中寻找“山谷”，也就是显著低于周围环境的区域。

需要注意的是，我寻找的山谷大小并不确定 - 可能从50个碱基到几千个碱基不等。当然，定义什么是山谷是我正在努力解决的问题，但前面的示例对我来说相对容易： alt text

您有什么建议使用哪些模式来查找这些山谷？我主要使用Perl和R编程。

谢谢！

- David B

3

在此跨贴：http://stats.stackexchange.com/questions/3052/how-to-look-for-valleys-in-a-graph如何寻找图形中的“峡谷”？ - Shane

2

能否在你转贴问题时请标明一下自己的身份？这可以节省我回答问题的时间，因为你可能已经在其他网站上得到了答案。 - Joris Meys

3个回答

1

你可以通过不同的标准来定义一条山谷：

深度
宽度
体积（深度*宽度）

你可能也会在大山里找到山谷，你也想要这些吗？

例如，在这里有一个山谷：1 2 3 4 1000 1000 800 800 800 1000 1000 500 200 3

尝试详细说明如何选择给定数据的山谷。

你可能需要查看watershed

- Loïc Février

通常来说，按体积计算。它越宽，我需要的深度就越少，仍然可以认为它是一个山谷。但是，我们可以考虑两种深度-绝对深度和相对深度（相对于周围环境）。我正在考虑相对深度。 - David B

找到具有更高但更接近的邻居的点（选择您想要的标准，例如，在100个邻域中至少比邻居高300）。这些点将是您的起始点：对于每个值h，计算以此点为最小值的高度h的山谷的宽度/高度/体积。根据宽度/高度的标准，在给定的h时停止。 - Loïc Février

0

您可能想尝试使用峰值检测函数来识别感兴趣的区域。所需谷底的最小宽度可以使用span参数指定。

最好先平滑数据，以消除像蓝色图形右侧“谷”中的噪声峰。一个简单的stats::filter应该就足够了。

最后一步是检查找到的“谷”的深度。这取决于您的要求。作为第一个近似值，您可以将峰值与数据的中位数水平进行比较。

- Andrei

你是否有最新的峰值检测函数链接？ - Christopher Bottoms

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Joris Meys · Accepted Answer

我们使用运行中位数和中位数绝对偏差进行峰值（和谷值）检测。您可以指定从运行中位数偏离多少被视为峰值。

接下来，我们使用二项模型检查哪些区域包含比预期更多的“极端”值。这个模型（基本上是一个得分测试）会得出“峰值区域”而不是单个峰值。将其反转以获取“谷值区域”很简单。

使用aroma.light软件包中的weightedMedian函数计算运行中位数。我们使用embed()函数创建“窗口”列表并在其上应用核函数。

加权中位数的应用：

center <- apply(embed(tmp,wdw),1,weightedMedian,w=weights,na.rm=T)

这里，tmp是临时数据向量，wdw是窗口大小（必须是奇数）。tmp是通过在数据向量的每一侧添加（wdw-1）/2个NA值来构造的。权重是使用自定义函数构建的。对于mad，我们使用相同的过程，但是在数据本身上进行diff（data）。

运行示例代码：

require(aroma.light)
# make.weights : function to make weights on basis of a normal distribution
# n is window size !!!!!!
make.weights <- function(n,
      type=c("gaussian","epanechnikov","biweight","triweight","cosinus")){
    type <- match.arg(type)
    x <- seq(-1,1,length.out=n)
    out <-switch(type,
          gaussian=(1/sqrt(2*pi)*exp(-0.5*(3*x)^2)),
          epanechnikov=0.75*(1-x^2),
          biweight=15/16*(1-x^2)^2,
          triweight=35/32*(1-x^2)^3,
          cosinus=pi/4*cos(x*pi/2),
          )
    out <- out/sum(out)*n
    return(out)
}

# score.test : function to become a p-value based on the score test
# uses normal approximation, but is still quite correct when p0 is
# pretty small.
# This test is one-sided, and tests whether the observed proportion
# is bigger than the hypothesized proportion
score.test <- function(x,p0,w){
    n <- length(x)
    if(missing(w)) w<-rep(1,n)
    w <- w[!is.na(x)]
    x <- x[!is.na(x)]

    if(sum(w)!=n) w <- w/sum(w)*n

    phat <- sum(x*w)/n
    z <- (phat-p0)/sqrt(p0*(1-p0)/n)
    p <- 1-pnorm(z)
    return(p)
}

# embed.na is a modification of embed, adding NA strings
# to the beginning and end of x. window size= 2n+1
embed.na <- function(x,n){
    extra <- rep(NA,n)
    x <- c(extra,x,extra)
    out <- embed(x,2*n+1)
    return(out)
}

# running.score : function to calculate the weighted p-value for the chance of being in
# a run of peaks. This chance is based on the weighted proportion of the neighbourhood
# the null hypothesis is calculated by taking the weighted proportion
# of detected peaks in the whole dataset.
# This lessens the need for adjusting parameters and makes the
# method more automatic.
# for a correct calculation, the weights have to sum up to n

running.score <- function(sel,n=20,w,p0){
    if(missing(w)) w<- rep(1,2*n+1)
    if(missing(p0))p0 <- sum(sel,na.rm=T)/length(sel[!is.na(sel)])   # null hypothesis
    out <- apply(embed.na(sel,n),1,score.test,p0=p0,w=w)
    return(out)
}

# running.med : function to calculate the running median and mad
# for a dataset. Window size = 2n+1
running.med <- function(x,w,n,cte=1.4826){
    wdw <- 2*n+1
    if(missing(w)) w <- rep(1,wdw)

    center <- apply(embed.na(x,n),1,weightedMedian,w=w,na.rm=T)
    mad <- median(abs(x-center))*cte
    return(list(med=center,mad=mad))
}

##############################################
#
# Create series
set.seed(100)
n = 1000
series <- diffinv(rnorm(20000),lag=1)

peaks <- apply(embed.na(series,n),1,function(x) x[n+1] < quantile(x,probs=0.05,na.rm=T))

pweight <- make.weights(0.2*n+1)
p.val <- running.score(peaks,n=n/10,w=pweight)

plot(series,type="l")
points((1:length(series))[p.val<0.05],series[p.val<0.05],col="red")
points((1:length(series))[peaks],series[peaks],col="blue")

上面的示例代码是用来查找具有大波动而不是低谷的区域的。我进行了一些调整，但并不是最优解。此外，对于超过20000个值的系列，您需要大量的内存，我不能在我的电脑上运行它。

或者，您可以使用数值导数和二阶导数的近似值来定义低谷。在您的情况下，这可能会更好。计算导数和第一导数的极小值/极大值的实用方法：

#first derivative
f.deriv <- diff(lowess(series,f=n/length(series),delta=1)$y)
#second derivative
f.sec.deriv <- diff(f.deriv)
#minima and maxima defined by where f.sec.deriv changes sign :
minmax <- cumsum(rle(sign(f.sec.deriv))$lengths)

op <- par(mfrow=c(2,1))
plot(series,type="l")
plot(f.deriv,type="l")
points((1:length(f.deriv))[minmax],f.deriv[minmax],col="red")
par(op)