我的数据按照V6中的ID进行分组,并按位置(V1:V3)排序:
我想做的是在每个V6组中,按位置添加一个额外的带索引的列。也就是说,每个V6组的第一个元素为“1”,第二个元素为“2”,以此类推。 我可以使用ddply和自定义函数实现这一点:
dt
V1 V2 V3 V4 V5 V6
1: chr1 3054233 3054733 . + ENSMUSG00000090025
2: chr1 3102016 3102125 . + ENSMUSG00000064842
3: chr1 3205901 3207317 . - ENSMUSG00000051951
4: chr1 3206523 3207317 . - ENSMUSG00000051951
5: chr1 3213439 3215632 . - ENSMUSG00000051951
6: chr1 3213609 3216344 . - ENSMUSG00000051951
7: chr1 3214482 3216968 . - ENSMUSG00000051951
8: chr1 3421702 3421901 . - ENSMUSG00000051951
9: chr1 3466587 3466687 . + ENSMUSG00000089699
10: chr1 3513405 3513553 . + ENSMUSG00000089699
我想做的是在每个V6组中,按位置添加一个额外的带索引的列。也就是说,每个V6组的第一个元素为“1”,第二个元素为“2”,以此类推。 我可以使用ddply和自定义函数实现这一点:
rankExons <- function(x){
if(unique(x$V5) == "+"){
x$index <- seq_len(nrow(x))}
else{
x$index <- rev(seq_len(nrow(x)))}
x
}
indexed <- ddply(dt, .(V6), rankExons)
indexed
V1 V2 V3 V4 V5 V6 index
1 chr1 3205901 3207317 . - ENSMUSG00000051951 6
2 chr1 3206523 3207317 . - ENSMUSG00000051951 5
3 chr1 3213439 3215632 . - ENSMUSG00000051951 4
4 chr1 3213609 3216344 . - ENSMUSG00000051951 3
5 chr1 3214482 3216968 . - ENSMUSG00000051951 2
6 chr1 3421702 3421901 . - ENSMUSG00000051951 1
7 chr1 3102016 3102125 . + ENSMUSG00000064842 1
8 chr1 3466587 3466687 . + ENSMUSG00000089699 1
9 chr1 3513405 3513553 . + ENSMUSG00000089699 2
10 chr1 3054233 3054733 . + ENSMUSG00000090025 1
不幸的是,在完整数据集(约620k行)上运行非常缓慢,而且使用并行时会崩溃。
library(doMC)
registerDoMC(cores=6)
indexed <- ddply(dt, .(V6), rankExons, .parallel=TRUE)
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Warning message:
In mclapply(argsList, FUN, mc.preschedule = preschedule, mc.set.seed = set.seed, :
all scheduled cores encountered errors in user code
所以,我尝试使用data.table,但无法使其正常工作。这是我尝试过的:
setkey(dt, "V6")
dt[,index:=rankExons(dt), by=V6]
dt[,rankExons(.sd), by=V6, .SDcols=c("V5, V6")]
两者都失败了。我该如何使用data.table重新创建我的ddply?
dt <- data.table::data.table(
V1 = rep("chr1", 10L),
V2 = c(
3054233L, 3102016L, 3205901L, 3206523L, 3213439L, 3213609L,
3214482L, 3421702L, 3466587L, 3513405L
),
V3 = c(
3054733L, 3102125L, 3207317L, 3207317L, 3215632L, 3216344L,
3216968L, 3421901L, 3466687L, 3513553L
),
V4 = rep(".", 10L),
V5 = rep(c("+", "-", "+"), c(2L, 6L, 2L)),
V6 = rep(
c(
"ENSMUSG00000090025", "ENSMUSG00000064842", "ENSMUSG00000051951",
"ENSMUSG00000089699"
),
c(1L, 1L, 6L, 2L)
)
)