使用sklearn的split方法减少内存使用的另一种方法是生成X的索引向量并在此向量上进行分割。之后,您可以选择条目并将训练和测试拆分写入磁盘。
import h5py
import numpy as np
from sklearn.cross_validation import train_test_split
X = np.random.random((10000,70000))
Y = np.random.random((10000,))
x_ids = list(range(len(X)))
x_train_ids, x_test_ids, Y_train, Y_test = train_test_split(x_ids, Y, test_size = 0.33, random_state=42)
f = h5py.File('dataset/train.h5py', 'w')
f.create_dataset(f"inputs", data=X[x_train_ids], dtype=np.int)
f.create_dataset(f"labels", data=Y_train, dtype=np.int)
f.close()
f = h5py.File('dataset/test.h5py', 'w')
f.create_dataset(f"inputs", data=X[x_test_ids], dtype=np.int)
f.create_dataset(f"labels", data=Y_test, dtype=np.int)
f.close()
f = h5py.File('dataset/train.h5py', 'r')
X_train = np.array(f.get('inputs'), dtype=np.int)
Y_train = np.array(f.get('labels'), dtype=np.int)
f.close()
f = h5py.File('dataset/test.h5py', 'r')
X_test = np.array(f.get('inputs'), dtype=np.int)
Y_test = np.array(f.get('labels'), dtype=np.int)
f.close()