如何在sklearn中基于列值切分数据

Question

如何在sklearn中基于列值切分数据

pythonmachine-learninglogistic-regressiontrain-test-splitsmote

11

我有一个数据文件，包含以下列：

'customer', 'calibrat' - 校准样本 = 1; 验证样本 = 0; 'churn', 'churndep', 'revenue', 'mou',

数据文件包含约40000行，其中20000行的calibrat值为1。我希望将此数据拆分成

X1 = data.loc[:, data.columns != 'churn']
y1 = data.loc[:, data.columns == 'churn']
from imblearn.over_sampling import SMOTE
os = SMOTE(random_state=0)
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.3, random_state=0)

我希望的是在我的X1_train中，应该有calibrat=1用于校准的数据，而在X1_test中应包含所有calibrat=0用于验证的数据。

- Guest

你尝试过使用

X1_train, X1_test, y1_train, y1_test = train_test_split(X1.loc[X1['calibrat']==1], y1.loc[X1['calibrat']!=1], test_size=0.3, random_state=0)

吗？ - Exi

不，这不起作用。 - Guest

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- yatu · Accepted Answer

sklearn.model_selection 提供了许多除了 train_test_split 之外的选项。其中一个目的是解决你所需要的问题。在这种情况下，您可以使用GroupShuffleSplit，如文档中所述，它提供了随机的训练/测试索引，以根据第三方提供的组拆分数据。当您进行交叉验证并且想要多次拆分验证训练时，确保集合由group字段拆分时，这非常有用。您还可以使用GroupKFold来处理这些情况，非常有用。

因此，根据您的示例，下面是您可以执行的操作。

假设您有：

from sklearn.model_selection import GroupShuffleSplit

cols = ['customer', 'calibrat', 'churn', 'churndep', 'revenue', 'mou',]
X = pd.DataFrame(np.random.rand(10, 6), columns=cols)
X['calibrat'] = np.random.choice([0,1], size=10)

print(X)

   customer  calibrat     churn  churndep   revenue       mou
0  0.523571         1  0.394896  0.933637  0.232630  0.103486
1  0.456720         1  0.850961  0.183556  0.885724  0.993898
2  0.411568         1  0.003360  0.774391  0.822560  0.840763
3  0.148390         0  0.115748  0.089891  0.842580  0.565432
4  0.505548         0  0.370198  0.566005  0.498009  0.601986
5  0.527433         0  0.550194  0.991227  0.516154  0.283175
6  0.983699         0  0.514049  0.958328  0.005034  0.050860
7  0.923172         0  0.531747  0.026763  0.450077  0.961465
8  0.344771         1  0.332537  0.046829  0.047598  0.324098
9  0.195655         0  0.903370  0.399686  0.170009  0.578925

y = X.pop('churn')

您现在可以实例化 GroupShuffleSplit，并像使用 train_test_split 一样操作，唯一的区别是需要指定一个group列，用于根据组值将X和y划分为组。

gs = GroupShuffleSplit(n_splits=2, train_size=.7, random_state=42)

如前所述，这在您想要将其拆分为多个组（通常是为了交叉验证目的）时更加方便。以下是一个示例，说明如何像问题中提到的那样获取两个拆分：

train_ix, test_ix = next(gs.split(X, y, groups=X.calibrat))

X_train = X.loc[train_ix]
y_train = y.loc[train_ix]

X_test = X.loc[test_ix]
y_test = y.loc[test_ix]

提供：

print(X_train)

   customer  calibrat  churndep   revenue       mou
3  0.148390         0  0.089891  0.842580  0.565432
4  0.505548         0  0.566005  0.498009  0.601986
5  0.527433         0  0.991227  0.516154  0.283175
6  0.983699         0  0.958328  0.005034  0.050860
7  0.923172         0  0.026763  0.450077  0.961465
9  0.195655         0  0.399686  0.170009  0.578925

print(X_test)

   customer  calibrat  churndep   revenue       mou
0  0.523571         1  0.933637  0.232630  0.103486
1  0.456720         1  0.183556  0.885724  0.993898
2  0.411568         1  0.774391  0.822560  0.840763
8  0.344771         1  0.046829  0.047598  0.324098