Pandas基于多列进行分层抽样

Question

Pandas基于多列进行分层抽样

8

我有一个类似于这样的pandas数据框：

| Cliid | Segment | Insert |
|-------|---------|--------|
| 001   | A       | 0      |
| 002   | A       | 0      |
| 003   | C       | 0      |
| 004   | B       | 1      |
| 005   | A       | 0      |
| 006   | B       | 0      |

我希望将其分为2组，使每个变量在 [Segment, Insert] 中的构成在两组中相同。例如，每个组将有属于 Segment A 的观测值的1/2，Insert = 1 的1/6等等。

我已经查看了this answer，但它只适用于一个变量的分层，无法对多个变量进行操作。

R有this function可以做到这一点，但不能使用R。

顺便说一下，我正在使用Python 3。

- arthur

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jannik · Accepted Answer

11

你可以使用sklearn的train_test_split函数，包括参数stratify，该参数可用于确定需要分层的列。

例如：

from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df1, test_size=0.2, stratify=df[["Segment", "Insert"]])

- Jannik

当没有目标变量时，stratify 在多列上似乎无法工作。当我运行您的代码时，我得到 ValueError: Found input variables with inconsistent numbers of samples: [6, 1]。如果我删除 stratify，它可以工作。 - arthur

3

抱歉，我的错误。我将 stratify=[["Segment", "Insert"]] 改成了 stratify=df[["Segment", "Insert"]]。 - Jannik

1

请问为什么会出现这个 ValueError 错误："y 中最少的类只有 1 个成员，这太少了。任何类的最小组数都不能少于 2"。数据是二进制目标，在将所有字符串列传递给 stratify 时，会出现上述错误，但仅传递目标列时，它可以正常工作，同时删除 stratify 也可以正常工作，因此是否应该考虑删除 stratify 来确定拆分是否基于所有列进行分层。 - hanzgs

这种方法是否好？https://dev59.com/E1cO5IYBdhLWcg3wny6-#51525992 - hanzgs