Scikit Learn: train_test_split，我能否确保不同数据集上的相同分割？

Question

Scikit Learn: train_test_split，我能否确保不同数据集上的相同分割？

scikit-learn

14

我知道train_test_split方法可以将数据集随机划分为训练集和测试集。使用random_state=int可以确保每次调用该方法时，此数据集的划分相同。

我的问题略有不同。

我有两个数据集A和B，它们包含相同的示例集，并且每个数据集中这些示例的顺序也相同。但关键区别在于，每个数据集中的示例使用不同的特征集。

我想测试一下，在A中使用的特征是否比B中使用的特征导致更好的性能。因此，我希望在A和B上调用train_test_split时，可以获取两个数据集上相同的拆分，以便比较有意义。

这可能吗？我是否只需要确保两个数据集的方法调用中的random_state相同？

谢谢

- Ziqi

1

保存从train_test_split返回的索引，然后使用它们是一个选项。 - Vivek Kumar

4个回答

8

查看 train_test_split 函数的代码，它在每次调用函数时都设置随机种子。因此，每次运行结果都会相同。我们可以简单地验证它是否有效。

X1 = np.random.random((200, 5))
X2 = np.random.random((200, 5))
y = np.arange(200)

X1_train, X1_test, y1_train, y1_test = model_selection.train_test_split(X1, y,
                                                                        test_size=0.1,
                                                                        random_state=42)
X2_train, X2_test, y2_train, y2_test = model_selection.train_test_split(X1, y,
                                                                        test_size=0.1,
                                                                        random_state=42)

print np.all(y1_train == y2_train)
print np.all(y1_test == y2_test)

输出结果如下：

True
True

很好！另一种解决这个问题的方法是在所有特征上创建一个训练和测试集，然后在训练之前分割你的特征。但是，如果你处于需要同时执行两个操作的奇怪情况下（有时候，如果你不想在训练集中放置测试特征，那么就不能在相似性矩阵中将其分开），那么可以使用StratifiedShuffleSplit函数返回属于每个集合的数据的索引。例如：

n_splits = 1 
sss = model_selection.StratifiedShuffleSplit(n_splits=n_splits, 
                                             test_size=0.1,
                                             random_state=42)
train_idx, test_idx = list(sss.split(X, y))[0]

- piman314

2

因为 sklearn.model_selection.train_test_split(*arrays, **options) 接受可变数量的参数，所以你可以这样做：

A_train, A_test, B_train, B_test, _, _ =  train_test_split(A, B, y, 
                                                           test_size=0.33,
                                                           random_state=42)

- torayeff

0

如上所述，您可以使用随机状态参数。但是，如果您想要全局生成相同的结果，即为所有未来调用设置随机状态，则可以使用。

np.random.seed('Any random number ')

- Alex Ferguson

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- eqzx · Accepted Answer

是的，随机状态已经足够。

>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X2 = np.hstack((X,X))
>>> X_train, X_test, _, _ = train_test_split(X,y, test_size=0.33, random_state=42)
>>> X_train2, X_test2, _, _ = train_test_split(X2,y, test_size=0.33, random_state=42)
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> X_train2
array([[4, 5, 4, 5],
       [0, 1, 0, 1],
       [6, 7, 6, 7]])
>>> X_test
array([[2, 3],
       [8, 9]])
>>> X_test2
array([[2, 3, 2, 3],
       [8, 9, 8, 9]])