聚类中的大距离矩阵

Question

聚类中的大距离矩阵

4

我在一台16 GB RAM的机器上运行R 3.2.3。我有一个大矩阵，它有3,00,000行x 12列。我想在R中使用层次聚类算法，因此在这之前，我正在尝试创建一个距离矩阵。由于数据是混合类型的，我使用不同的矩阵来处理不同类型的数据。但是，我遇到了关于内存分配的错误：

df <- as.data.frame(matrix(rnorm(36*10^5), nrow = 3*10^5))
d1=as.dist(distm(df[,c(1:2)])/10^5)
d2=dist(df[,c(3:8)], method = "euclidean") 
d3= hamming.distance(df[,c(9:12)]%>%as.matrix(.))%>%as.dist(.)

我遇到了以下错误。

> d1=as.dist(distm(df1[,c(1:2)])/10^5)
Error: cannot allocate vector of size 670.6 Gb
In addition: Warning messages:
1: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
2: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
3: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
4: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
> d2=dist(df1[,c(3:8)], method = "euclidean") 
Error: cannot allocate vector of size 335.3 Gb
In addition: Warning messages:
1: In dist(df1[, c(3:8)], method = "euclidean") :
 Reached total allocation of 16070Mb: see help(memory.size)
2: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
3: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
4: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
> d3= hamming.distance(df1[,c(9:12)]%>%as.matrix(.))%>%as.dist(.)
Error: cannot allocate vector of size 670.6 Gb
In addition: Warning messages:
1: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
2: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
3: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
4: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)

- Kanika Singhal

你不需要一次性处理所有数据，这样会消耗所有内存并出现错误。考虑分批处理，例如每次处理10000个向量。 - Patric

但是在聚类中，我们需要计算一行到所有其他行的距离。那么分批次计算会对这里有什么帮助吗？ - Kanika Singhal

是的，但你可以进行最终缩减以选择最小/最大值。这有意义吗？为了高效计算距离，您可以参考这里。 - Patric

通过选择最小/最大值来减少？抱歉，我不理解。更详细的了解可能会有所帮助。 - Kanika Singhal

是的，那不是内置函数，但你仍然可以使用 dist。或者，也许一些内存交换库可以帮助你在 RAM 和磁盘之间交换数据，但性能会很慢。对于大型数据集，大多数时候我们不得不像这样做一些技巧，即使它有点丑陋。 - Patric

显示剩余3条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Patric · Accepted Answer

假设您有一个行（A）和一个3^8矩阵（B），需要通过最小距离进行聚类。为了简单起见，我们假设只有一行需要聚类。

原始方法如下：

1. load A and B
2. distance compute A with each row of B
3. select smallest one from results (reduction)

由于B非常大，您无法将其加载到内存中或在执行过程中出现错误。

批量处理的方法如下：

1. load A (suppose it is small)
2. load B.partial with 1 to 1^5 rows of B
3. compute distance of A with each row of B.partial
4. select min one in partial results and save it as res[i]
5. go back 2.) load next 1^5 rows of B 
6. final your got a 3000 partial results and saved in res[1:3000]
7. reduction : select min one from res[1:3000]
   note: if you need all distances as `dist` function, you don't need reduction and just keep this array.

代码会比原来的复杂一些，但这在处理大数据问题时是非常常见的技巧。对于计算部分，您可以参考我之前在这里的一个答案。

如果您能够在此处以批处理模式粘贴最终代码，那将非常合适，这样其他人也可以学习。

关于dist的另一个有趣的事情是它是R包中支持openMP的少数之一。请参见这里的源代码以及如何在这里使用openMP进行编译。

因此，如果您可以尝试根据您的机器设置OMP_NUM_THREADS为4或8，然后再次运行，您可以看到性能的大幅提升！

 void R_distance(double *x, int *nr, int *nc, double *d, int *diag,
    int *method, double *p)
{
     int dc, i, j;
     size_t  ij;  /* can exceed 2^31 - 1 */
     double (*distfun)(double*, int, int, int, int) = NULL;
     #ifdef _OPENMP
        int nthreads;
     #endif
     .....
 }

此外，如果您想通过GPU加速dist，您可以参考ParallelR中的talk部分。