按多列分组，在一列中计算所有可能的值对数

Question

按多列分组，在一列中计算所有可能的值对数

5

我有一个数据框，看起来像这样（这只是一个子集，实际上数据集有2724098行）

> head(dat)

chr   start  end    enhancer motif 
chr10 238000 238600 9_EnhA1  GATA6 
chr10 238000 238600 9_EnhA1  GATA4 
chr10 238000 238600 9_EnhA1    SRF 
chr10 238000 238600 9_EnhA1  MEF2A 
chr10 375200 375400 9_EnhA1  GATA6 
chr10 375200 375400 9_EnhA1  GATA4 
chr10 440400 441000 9_EnhA1  GATA6 
chr10 440400 441000 9_EnhA1  GATA4 
chr10 440400 441000 9_EnhA1    SRF 
chr10 440400 441000 9_EnhA1  MEF2A 
chr10 441600 442000 9_EnhA1    SRF 
chr10 441600 442000 9_EnhA1  MEF2A

我能够将数据集转换为这种格式，其中每组chr、start、end和enhancer表示一个单独的ID：

> dat

 id motif 
 1  GATA6 
 1  GATA4 
 1    SRF 
 1  MEF2A 
 2  GATA6 
 2  GATA4
 3  GATA6 
 3  GATA4 
 3    SRF 
 3  MEF2A 
 4    SRF 
 4  MEF2A

我希望能够按id分组，找到每个可能的motifs对数并计数。因此，我需要一个输出表格，格式如下：

motif1 motif2 count
 GATA6  GATA4     3
 GATA6    SRF     2
 GATA6  MEF2A     2
 ... and so on for each pair of motif

在实际数据集中，有1716个独特的图案。有83509个独特的ID。

对于如何继续进行，是否有任何建议？

- Komal Rathi

你的数据中有多少个唯一的id？ - Gregor Thomas

@Gregor 我会在几分钟内编辑我的问题以便更加清晰明了。 - Komal Rathi

我不知道是否实用，但一个可能的方法是：(1)将所有内容转换为整数，并将您的分组变量压缩成一个单一的“id”，然后(2)创建一个1716x1716x83509的三维稀疏数组（使用“slam”包），其中条目是存在/不存在的二进制值，(3)你的结果是沿着“id”维度的“colsums”。 - Gregor Thomas

（2）显然是难点。或者，也许可以使用 SQL 来处理，这将是一个相当简单的自连接和聚合查询。 - Gregor Thomas

@Gregor虽然感谢您的建议，但我没有SQL方面的背景。 - Komal Rathi

显示剩余3条评论

7个回答

5

这里是一种稀疏矩阵技术，它毫不掩饰地借鉴自这个问题。

# Create an id
dat$id <- as.factor(paste(dat$chr, dat$start, dat$end, dat$enhancer))

# Create the sparse matrix.
library(Matrix)
s <- sparseMatrix(
      as.numeric(dat$id), 
      as.numeric(dat$motif),
      dimnames = list(levels(dat$id),levels(dat$motif)),
  x = TRUE)

co.oc <- t(s) %*% s # Find co-occurrences.
tab <- summary(co.oc) # Create triplet representation.
tab <- tab[tab$i < tab$j,] # Extract upper triangle of matrix

data.frame(motif1 = levels(dat$motif)[tab$i],
           motif2 = levels(dat$motif)[tab$j],
           number = tab$x)

#    motif1 motif2 number
# 1  GATA4  GATA6      3
# 2  GATA4  MEF2A      2
# 3  GATA6  MEF2A      2
# 4  GATA4    SRF      2
# 5  GATA6    SRF      2
# 6  MEF2A    SRF      3

- nograpes

1

@Arun 我怀疑可以通过仅计算矩阵的“上（或下）三角形”来使其更快。 - nograpes

3

我认为data.table包可能是最有效的选择。我们可以在每个组内计算成对数量，然后进行聚合。与先计算所有配对总数相比，这是一种更加高效的方法，特别适用于您这样规模的数据。

#Bring in data.table and convert data to data.table
require(data.table)
setDT(dat)

#Summarize by two-way pairs
summ <- dat[ , list(motifs=list(combn(unique(as.character(motif)),
   min(2,length(unique(as.character(motif)))), by=list(chr,start,end,enhancer)]

#Transpose and gather data into one table
motifs.table <- rbindlist(lapply(summ$motifs,function(x) data.table(t(x))))

#Summarize table with counts
motifs.table[ , .N, by=list(V1,V2)]

#       V1    V2 N
# 1: GATA6 GATA4 3
# 2: GATA6   SRF 2
# 3: GATA6 MEF2A 2
# 4: GATA4   SRF 2
# 5: GATA4 MEF2A 2
# 6:   SRF MEF2A 3

- Mike.Gahan

谢谢。当我尝试第二个命令summ <- ...时，我会收到以下错误：Error in combn(unique(as.character(motif)), 2) : n < m - Komal Rathi

一定会有一些只有一个模式的组。我猜这个答案没有完全考虑到这一点。我会进行修改，但是@Aruns的答案更好。 - Mike.Gahan

好的。感谢你的帮助。+1 鼓励你尝试帮忙。 - Komal Rathi

2

如果您可以将数据放入名为dat的SQL表中，此查询应该有效：

select d1.motif m1, d2.motif m2, count(*) count
from dat d1
join dat d2
on d1.chr = d2.chr
  and d1.start = d2.start
  and d1.end = d2.end
  and d1.enhancer = d2.enhancer
  and d1.motif <> d2.motif
group by d1.motif, d2.motif

考虑到你的数据规模，我怀疑 R 的 sqldf 包无法处理它，但是如果你安装了免费的 MySQL，你可以使用 RODBC 或 RJDBC 使 R 和 SQL 进行通信。

- Gregor Thomas

你几乎可以使用R函数来模拟这种行为。 - shadowtalker

可以的。只是我对 data.table 不够熟悉，不敢确定使用其他方法是否会占用更多内存。我认为这就是 @Arun 回答中实现的逻辑。 - Gregor Thomas

2

您可能会从正式建模数据的语义中受益。如果您在基因组上有范围，则可以使用Bioconductor的GenomicRanges软件包。

library(GenomicRanges)
gr <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE)

这是一个GRanges对象，其正式理解基因组位置的概念，因此这些操作都可以直接使用：

hits <- findMatches(gr, gr)
tab <- table(motif1=gr$motif[queryHits(hits)],
             motif2=gr$motif[subjectHits(hits)])
subset(as.data.frame(tab, responseName="count"), motif1 != motif2)

- Michael Lawrence

1

如果这不是你想要的，我就放弃了。显然它没有针对大数据集进行优化。这只是一个利用R的一般性算法。有几个改进的可能性，例如使用dplyr和data.table。后者将大大加快这里的[和%in%操作。

motif_pairs <- combn(unique(dat$motif), 2)
colnames(motif_pairs) <- apply(motif_pairs, 2, paste, collapse = " ")
motif_pair_counts <- apply(motif_pairs, 2, function(motif_pair) {
  sum(daply(dat[dat$motif %in% motif_pair, ], .(id), function(dat_subset){
    all(motif_pair %in% dat_subset$motif)
  }))
})
motif_pair_counts <- as.data.frame(unname(cbind(t(motif_pairs), motif_pair_counts)))
names(motif_pair_counts) <- c("motif1", "motif2", "count")
motif_pair_counts

#   motif1 motif2 count
# 1  GATA6  GATA4     3
# 2  GATA6    SRF     2
# 3  GATA6  MEF2A     2
# 4  GATA4    SRF     2
# 5  GATA4  MEF2A     2
# 6    SRF  MEF2A     3

另一个旧版本。请确保你的问题清晰明了！这正是plyr的设计目的。尝试使用dlply(dat，。（id），function（x）table（x $ motif））。但请不要仅仅尝试复制和粘贴此解决方案，至少要阅读文档。plyr是一个非常强大的包，理解它将非常有帮助。

回答错误问题的旧帖：

您是在寻找不相交还是重叠的一对？

这里提供一种使用包zoo中的函数rollapply的解决方案：

library(zoo)

motif_pairs <- rollapply(dat$motif, 2, c)              # get a matrix of pairs
motif_pairs <- apply(motif_pairs, 1, function(row) {   # for every row...
  paste0(sort(row), collapse = " ")                    #   sort the row, and concatenate it to a single string
                                                       #   (sorting ensures that pairs are not double-counted)
})
table(motif_pairs)                                     # since each pair is now represented by a unique string, just tabulate the string appearances

## if you want disjoint pairs, do `rollapply(dat$motif, 2, c, by = 2)` instead

如果这不是你需要的，请查看rollapply的文档。对于其他变量的分组，您可以将其与以下之一结合使用：

基本R函数aggregate或by（不建议使用），或者
plyr中的*ply函数（更好）

- shadowtalker

我已经将我的变量分组到一个单一的ID中。 - Komal Rathi

什么是不相交和重叠的配对？我只想计算每个组中一对出现的次数。如何将输出中的数字转换为图案名称？ - Komal Rathi

@KomalRathi 如果这不正确，你需要更清楚地说明你所说的“一对”是什么意思。 - shadowtalker

我并不是有意冒犯你，也没有任何意思暗示你的错误。我的意思是询问您什么是不相交和重叠对，然后我只是试图解释我需要什么。通过“对”，我指的是每个可能的基序对。我还提供了一个基于所展示数据的所需输出的示例。在所展示的数据中，GATA6和GATA4是两个基序。它们在3个不同的ID中共同出现。因此计数为3。同样，GATA6和SRF在2个不同的ID中共同出现，因此计数为2。 - Komal Rathi

不会有任何冒犯。我猜你的意思是“相邻”的一对，但从你的帖子中并不清楚。 - shadowtalker

显示剩余5条评论

1

这个怎么样？

res1<- split(dat$motif,dat$id)
res2<- lapply(res1,function(x) combn(x,2))
res3<- apply(do.call(cbind,res2),2,function(x) paste(x[1],x[2],sep="_"))

table(res3)

- DatamineR

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Arun · Accepted Answer

更新：以下是使用data.table实现的快速和内存高效版本：

步骤1：大致构造您需要的样本数据维度：

require(data.table) ## 1.9.4+
set.seed(1L)        ## For reproducibility
N = 2724098L
motif = sample(paste("motif", 1:1716, sep="_"), N, TRUE)
id = sample(83509, N, TRUE)
DT = data.table(id, motif)

步骤2：预处理：

DT = unique(DT) ## IMPORTANT: not to have duplicate motifs within same id
setorder(DT)    ## IMPORTANT: motifs are ordered within id as well
setkey(DT, id)  ## reset key to 'id'. Motifs ordered within id from previous step
DT[, runlen := .I]

步骤3：解决方案：

ans = DT[DT, {
              tmp = runlen < i.runlen; 
              list(motif[tmp], i.motif[any(tmp)])
             }, 
      by=.EACHI][, .N, by="V1,V2"]

在最后的第三步，这需要约27秒和1GB的内存。

思路是进行自我连接，但利用data.table的by=.EACHI功能，它为每个i计算j-expression，因此效率更高。而j-expression确保我们仅获得条目“motif_a，motif_b”，而不是冗余的“motif_b，motif_a”。这既节省了计算时间又节省了内存。即使存在87K+的ID，二进制搜索也非常快。最后，我们按图案组合聚合以获取每个图案中的行数-这是您需要的内容。

希望对您有所帮助

附注：请参见旧版（+较慢）的版本。