我刚刚尝试使用sklearn.decomposition的IncrementalPCA,但它像PCA和RandomizedPCA一样抛出了MemoryError。我的问题是,我尝试加载的矩阵太大无法放入RAM中。现在它作为形状为~(1000000,1000)的数据集存储在hdf5数据库中,因此我有10亿个float32值。我以为IncrementalPCA会批处理加载数据,但显然它试图加载整个数据集,这并没有帮助。这个库应该如何使用?hdf5格式是问题所在吗?
from sklearn.decomposition import IncrementalPCA
import h5py
db = h5py.File("db.h5","r")
data = db["data"]
IncrementalPCA(n_components=10, batch_size=1).fit(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/decomposition/incremental_pca.py", line 165, in fit
X = check_array(X, dtype=np.float)
File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 337, in check_array
array = np.atleast_2d(array)
File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/shape_base.py", line 99, in atleast_2d
ary = asanyarray(ary)
File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/numeric.py", line 514, in asanyarray
return array(a, dtype, copy=False, order=order, subok=True)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2458)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2415)
File "/software/anaconda/2.3.0/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 640, in __array__
arr = numpy.empty(self.shape, dtype=self.dtype if dtype is None else dtype)
MemoryError
感谢您的帮助
h5['data']
)似乎像常规的numpy数组一样运作,但实际上并非如此。IncrementalPCA
不知道它是一个磁盘上的数据结构,并且在某个时刻读取所有行(MemoryError
!)。计算仍然以batch_size
批次执行。 - sastaninfit()
中,它调用了check_array()
,该函数应将数据转换为常规的numpy数组(https://github.com/scikit-learn/scikit-learn/blob/0.16.1/sklearn/utils/validation.py#L268)。调用partial_fit()
会绕过此转换。 - sastanin