Python让UMAP更快（更快）

Question

Python让UMAP更快（更快）

4

我正在使用UMAP (https://umap-learn.readthedocs.io/en/latest/#) 来降低我的数据维度。我的数据集包含4700个样本，每个样本有120万个特征（我想要降低）。然而，即使使用了32个CPU和120GB的RAM，这仍然需要相当长的时间。特别是嵌入构建的过程很慢，而且冗长的输出在过去3.5小时内没有改变：

UMAP(dens_frac=0.0, dens_lambda=0.0, low_memory=False, n_neighbors=10,
     verbose=True)
Construct fuzzy simplicial set
Mon Jul  5 09:43:28 2021 Finding Nearest Neighbors
Mon Jul  5 09:43:28 2021 Building RP forest with 59 trees
Mon Jul  5 10:06:10 2021 metric NN descent for 20 iterations
     1  /  20
     2  /  20
     3  /  20
     4  /  20
     5  /  20
    Stopping threshold met -- exiting after 5 iterations
Mon Jul  5 10:12:14 2021 Finished Nearest Neighbor Search
Mon Jul  5 10:12:25 2021 Construct embedding

有没有办法让这个过程更快？我已经按照这里所述使用了稀疏矩阵 (scipy.sparse.lil_matrix)：https://umap-learn.readthedocs.io/en/latest/sparse.html。此外，我还安装了 pynndescent（如此处所述：https://github.com/lmcinnes/umap/issues/416）。我的代码如下：

from scipy.sparse import lil_matrix
import numpy as np
import umap.umap_ as umap

term_dok_matrix = np.load('term_dok_matrix.npy')
term_dok_mat_lil = lil_matrix(term_dok_matrix, dtype=np.float32)

test = umap.UMAP(a=None, angular_rp_forest=False, b=None,
     force_approximation_algorithm=False, init='spectral', learning_rate=1.0,
     local_connectivity=1.0, low_memory=False, metric='euclidean',
     metric_kwds=None, n_neighbors=10, min_dist=0.1, n_components=2, n_epochs=None, 
     negative_sample_rate=5, output_metric='euclidean',
     output_metric_kwds=None, random_state=None, repulsion_strength=1.0,
     set_op_mix_ratio=1.0, spread=1.0, target_metric='categorical',
     target_metric_kwds=None, target_n_neighbors=-1, target_weight=0.5,
     transform_queue_size=4.0, unique=False, verbose=True).fit_transform(term_dok_mat_lil)

有没有什么技巧或想法可以使计算更快？我能改变一些参数吗？我的矩阵只由0和1组成（意味着矩阵中所有非零元素都是1），这会有所帮助吗？

- LaLeLo

2个回答

1

您可以对数据集进行PCA操作。最大主成分数量为4700个，比12亿要好得多。

之后，您可以按以下方式计算precomputed_knn：

import umap
from umap.umap_ import nearest_neighbors

precomputed_knn = nearest_neighbors(
        data_pca, n_neighbors = 3000, metric="euclidean",
        metric_kwds=None, angular=False, random_state=1)

那么：

umap.UMAP(precomputed_knn=precomputed_knn)

- Le Quang Nam

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Leland McInnes · Accepted Answer

如果您有120万个特征和仅有4700个样本，建议您预先计算完整的距离矩阵，并使用 metric="precomputed" 参数将其传递进去。目前它正在花费大量的功夫计算这些120万维向量的最近邻。直接使用暴力方法会更好。