scikit-learn中的PCA投影和重构

Question

scikit-learn中的PCA投影和重构

31

我可以通过以下代码在scikit中进行PCA： X_train有279180行和104列。

from sklearn.decomposition import PCA
pca = PCA(n_components=30)
X_train_pca = pca.fit_transform(X_train)

现在，当我想要将特征向量投影到特征空间时，我必须执行以下操作：

""" Projection """
comp = pca.components_ #30x104
com_tr = np.transpose(pca.components_) #104x30
proj = np.dot(X_train,com_tr) #279180x104 * 104x30 = 297180x30

但我在犹豫这一步，因为Scikit的文档说：

components_: array, [n_components, n_features]

在特征空间中的主要轴，代表数据中方差最大的方向。

对我来说，它似乎已经被投影了，但是当我检查源代码时，它只返回特征向量。

那么如何正确地进行投影呢？

我的最终目标是计算重建的均方误差。

""" Reconstruct """
recon = np.dot(proj,comp) #297180x30 * 30x104 = 279180x104

"""  MSE Error """
print "MSE = %.6G" %(np.mean((X_train - recon)**2))

- HonzaB

2个回答

6

在 @eickenberg 的帖子上补充一下，以下是如何对数字图像进行主成分分析重建：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn import decomposition

n_components = 10
image_shape = (8, 8)

digits = load_digits()
digits = digits.data

n_samples, n_features = digits.shape
estimator = decomposition.PCA(n_components=n_components, svd_solver='randomized', whiten=True)
digits_recons = estimator.inverse_transform(estimator.fit_transform(digits))

# show 5 randomly chosen digits and their PCA reconstructions with 10 dominant eigenvectors
indices = np.random.choice(n_samples, 5, replace=False)
plt.figure(figsize=(5,2))
for i in range(len(indices)):
    plt.subplot(1,5,i+1), plt.imshow(np.reshape(digits[indices[i],:], image_shape)), plt.axis('off')
plt.suptitle('Original', size=25)
plt.show()
plt.figure(figsize=(5,2))
for i in range(len(indices)):
    plt.subplot(1,5,i+1), plt.imshow(np.reshape(digits_recons[indices[i],:], image_shape)), plt.axis('off')
plt.suptitle('PCA reconstructed'.format(n_components), size=25)
plt.show()

- Sandipan Dey

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- eickenberg · Accepted Answer

你可以做

proj = pca.inverse_transform(X_train_pca)

这样，您就不必担心如何进行乘法运算。

使用 pca.fit_transform 或 pca.transform 后得到的是通常被称为每个样本的“载荷”，也就是使用components_（特征空间中的主轴）的线性组合最佳描述该样本所需的各个成分的量。

您所追求的投影在原始信号空间中。这意味着您需要使用成分和载荷返回信号空间。

因此，这里需要澄清三个步骤。以下是使用 PCA 对象可以执行的操作以及实际计算方式：

pca.fit 估计组件（使用中心化的 Xtrain 上的奇异值分解）：

 from sklearn.decomposition import PCA
 import numpy as np
 from numpy.testing import assert_array_almost_equal

 #Should this variable be X_train instead of Xtrain?
 X_train = np.random.randn(100, 50)

 pca = PCA(n_components=30)
 pca.fit(X_train)

 U, S, VT = np.linalg.svd(X_train - X_train.mean(0))

 assert_array_almost_equal(VT[:30], pca.components_)

pca.transform会按照您所描述的方式计算载荷

 X_train_pca = pca.transform(X_train)

 X_train_pca2 = (X_train - pca.mean_).dot(pca.components_.T)

 assert_array_almost_equal(X_train_pca, X_train_pca2)

pca.inverse_transform 可以获得在您感兴趣的信号空间中的成分投影。

 X_projected = pca.inverse_transform(X_train_pca)
 X_projected2 = X_train_pca.dot(pca.components_) + pca.mean_

 assert_array_almost_equal(X_projected, X_projected2)

您现在可以评估投影损失。

loss = np.sum((X_train - X_projected) ** 2, axis=1).mean()