scikit-learn中的分层训练/测试拆分

Question

scikit-learn中的分层训练/测试拆分

130

我需要将数据分成训练集（75%）和测试集（25%）。我目前使用以下代码执行此操作：

X, Xt, userInfo, userInfo_train = sklearn.cross_validation.train_test_split(X, userInfo)

然而，我想将我的训练数据集分层。我该怎么做？我一直在研究StratifiedKFold方法，但它不允许我指定75%/25%的分割，并且只对训练数据集进行分层。

- pir

9个回答

38

您可以使用Scikit learn中提供的train_test_split()方法来轻松完成此操作:

from sklearn.model_selection import train_test_split 
train, test = train_test_split(X, test_size=0.25, stratify=X['YOUR_COLUMN_LABEL'])

我还准备了一个简短的 GitHub Gist，展示了 stratify 选项的工作原理：

https://gist.github.com/SHi-ON/63839f3a3647051a180cb03af0f7d0d9

- Shayan Amani

33

简述：使用StratifiedShuffleSplit和test_size=0.25进行分层抽样。

Scikit-learn提供了两个模块用于分层抽样：

StratifiedKFold ：此模块可用作直接k折交叉验证运算符：它将设置n_folds训练/测试集，使得类在两个集合中平衡。

以下是一些代码（直接来自上面的文档）

>>> skf = cross_validation.StratifiedKFold(y, n_folds=2) #2-fold cross validation
>>> len(skf)
2
>>> for train_index, test_index in skf:
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
...    #fit and predict with X_train/test. Use accuracy metrics to check validation performance

StratifiedShuffleSplit ：该模块创建一个训练/测试集，其类别平衡（分层）。本质上，这就是您希望使用 n_iter=1 实现的内容。您可以在此处提及与 train_test_split 相同的测试大小。

>>> sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
>>> len(sss)
1
>>> for train_index, test_index in sss:
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
>>> # fit and predict with your classifier using the above X/y train/test

- tangy

6

请注意，从版本0.18.x开始，StratifiedShuffleSplit中的n_iter应替换为n_splits，同时它的API也稍有不同：http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html - lollercoaster

3

如果y是一个Pandas序列，使用y.iloc[train_index], y.iloc[test_index]。 - Owlright

1

@Owlright 我尝试使用pandas dataframe，但是StratifiedShuffleSplit返回的索引不是dataframe中的索引。dataframe索引：2、3、5 sss中的第一个分割：[(array([2, 1]), array([0]))] :( - Meghna Natraj

2

@tangy 为什么要用 for 循环？当调用 X_train, X_test = X[train_index], X[test_index] 这一行时，它不是会覆盖 X_train 和 X_test 吗？那么为什么不只用一个 next(sss) 呢？ - Bartek Wójcik

如果你遇到了 "TypeError: 'StratifiedShuffleSplit' object is not iterable" 错误，也许这篇文章可以帮助你：https://dev59.com/eLDla4cB1Zd3GeqP_60k - DnVS

20

这是一个连续/回归数据的例子（直到GitHub上的这个问题得到解决）。

min = np.amin(y)
max = np.amax(y)

# 5 bins may be too few for larger datasets.
bins     = np.linspace(start=min, stop=max, num=5)
y_binned = np.digitize(y, bins, right=True)

X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    stratify=y_binned
)

其中start为您的连续目标的最小值，stop为最大值。
如果您不设置right=True，那么它将或多或少地使您的最大值成为一个单独的bin，并且您的分割将始终失败，因为那个额外的bin中的样本太少。

- Jordan

6

除了@Andreas Mueller的接受答案之外，我只想补充一点，就是如@tangy所提到的：

StratifiedShuffleSplit最接近train_test_split（stratify = y），并增加了以下功能：

默认情况下进行分层抽样
通过指定n_splits，它会重复拆分数据

- Max

2

在我们选择应在所有即将生成的小数据集中均匀表示的列之后，会使用StratifiedShuffleSplit进行操作。 '折叠是通过保留每个类别样本的百分比来完成的。

假设我们有一个名为'data'的数据集，其中包含一个名为'season'的列，并且我们希望获得'season'的均衡表示，则如下所示：

from sklearn.model_selection import StratifiedShuffleSplit
sss=StratifiedShuffleSplit(n_splits=1,test_size=0.25,random_state=0)

for train_index, test_index in sss.split(data, data["season"]):
    sss_train = data.iloc[train_index]
    sss_test = data.iloc[test_index]

- Itay Guy

1

因此，最好将数据集分成训练集和测试集，并以保留与原始数据集中每个类别相同比例的示例的方式进行。这被称为分层训练测试拆分。

我们可以通过将“分层”参数设置为原始数据集的y组件来实现这一点。 train_test_split()函数将使用此参数确保训练集和测试集都具有提供的“y”数组中存在的每个类别的示例比例。

- dev guy

0

#train_size is 1 - tst_size - vld_size
tst_size=0.15
vld_size=0.15

X_train_test, X_valid, y_train_test, y_valid = train_test_split(df.drop(y, axis=1), df.y, test_size = vld_size, random_state=13903) 

X_train_test_V=pd.DataFrame(X_train_test)
X_valid=pd.DataFrame(X_valid)

X_train, X_test, y_train, y_test = train_test_split(X_train_test, y_train_test, test_size=tst_size, random_state=13903)

- José Carlos Castro

0

更新@tangy的答案，以适应当前版本的scikit-learn：0.23.2（StratifiedShuffleSplit文档）。

from sklearn.model_selection import StratifiedShuffleSplit

n_splits = 1  # We only want a single split in this case
sss = StratifiedShuffleSplit(n_splits=n_splits, test_size=0.25, random_state=0)

for train_index, test_index in sss.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

- Roei Bahumi

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Andreas Mueller · Accepted Answer

[0.17版本更新]

请参考sklearn.model_selection.train_test_split文档：

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y, 
                                                    test_size=0.25)

[/更新至版本0.17]

这里有一个拉取请求。但是如果您愿意，您可以简单地执行train, test = next(iter(StratifiedKFold(...)))并使用训练和测试索引。