正如评论中所提到的,所选答案未保留数据的类分布。scikit-learn文档指出,如果需要,则应使用StratifiedShuffleSplit。这可以通过将目标数组传递给分层选项来使用train_test_split
方法实现。
>>> import numpy as np
>>> from sklearn import datasets
>>> from sklearn.model_selection import train_test_split
>>> X, y = datasets.load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, stratify=y, random_state=42)
>>>
>>> print(np.unique(y, return_counts=True))
(array([0, 1, 2]), array([50, 50, 50], dtype=int64))
>>> print(np.unique(y_test, return_counts=True))
(array([0, 1, 2]), array([16, 17, 17], dtype=int64))
>>> print(np.unique(y_train, return_counts=True))
(array([0, 1, 2]), array([34, 33, 33], dtype=int64))
R
的行为? - WestCoastProjects