使用Sklearn进行群组/簇K-Fold交叉验证

Question

使用Sklearn进行群组/簇K-Fold交叉验证

5

我需要对一些模型进行K折交叉验证，但我需要确保验证（测试）数据集按照组和t年聚类在一起。GroupKFold接近这个要求，但它仍然会分割验证集（见第二次折叠）。

例如，如果我有一个包含2000-2008年的数据集，并且我想将其K折成3组。适当的集合应该是：验证集：2000-2002，训练集：2003-2008；V：2003-2005，T：2000-2002和2006-2008；以及V：2006-2008，T：2000-2005。

是否有一种方法可以使用K-Fold CV对数据进行分组和聚类，使得验证集按照t年聚类？

from sklearn.model_selection import GroupKFold

X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10, 0.1, 0.2, 2.2]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d", "a", "b", "b"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4]

gkf = GroupKFold(n_splits=2)
for train_index, test_index in gkf.split(X, y, groups=groups):
    print("Train:", train_index, "Validation:",test_index)

输出：

Train: [ 0  1  2  3  4  5 10 11 12] Validation: [6 7 8 9]
Train: [3 4 5 6 7 8 9] Validation: [ 0  1  2 10 11 12]
Train: [ 0  1  2  6  7  8  9 10 11 12] Validation: [3 4 5]

期望的输出结果（假设每组为2年）：

Train: [ 7 8 9 10 11 12 ] Validation: [0 1 2 3 4 5 6]
Train: [0 1 2 10 11 12 ] Validation: [ 3 4 5 6 7 8 9 ]
Train: [ 0  1  2  3 4 5 ] Validation: [6 7 8 9 10 11 12]

虽然测试和训练子集不是按顺序选择的,可以选择更多年份进行分组。

- Vedda

我不明白你的groups列表与你想要的输出有什么关系，也不知道你之前提到的2000年到2008年这9年与你想要的输出有何联系。也许是我自己的问题，但我并不太理解输入和输出之间的关系以及你的目标是什么。 - Merlin1896

@Merlin1896 在期望的输出中，我选择组1和2、2和3以及3和4进行验证。然后我想使用剩余的进行训练，因此是组3和4、1和4以及1和2。在您的答案中，您只选择了一个组作为验证集，而我想要两个（或更多在较大的数据集中）。您的想法是正确的，我只想选择聚类的组，例如两年。 - Vedda

但是为什么测试集中索引6出现了三次，而训练集中从未出现？我猜这是一个打字错误？如果是的话，请查看我的编辑答案。 - Merlin1896

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Merlin1896 · Accepted Answer

我希望我理解得正确。

scikits的model_selection中的LeaveOneGroupOut方法可能会有所帮助：

假设你将组标签0分配给2000-2002年的所有数据点，将标签1分配给2003年至2005年之间的所有数据点，将标签2分配给2006-2008年的数据。然后，您可以使用以下方法创建训练和测试拆分，其中三个测试拆分是从三个组中的一个创建的：

from sklearn.model_selection import LeaveOneGroupOut
import numpy as np
groups=[1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,3,3]
X=np.random.random(len(groups))
y=np.random.randint(0,4,len(groups))

logo = LeaveOneGroupOut()
print("n_splits=", logo.get_n_splits(X,y,groups))
for train_index, test_index in logo.split(X, y, groups):
    print("train_idx:", train_index, "test_idx:", test_index)

输出：

n_splits= 3
train_idx: [ 4  5  6  7  8  9 10 11 12 13 14 15 16 17] test_idx: [0 1 2 3]
train_idx: [ 0  1  2  3 10 11 12 13 14 15 16 17] test_idx: [4 5 6 7 8 9]
train_idx: [0 1 2 3 4 5 6 7 8 9] test_idx: [10 11 12 13 14 15 16 17]

编辑

我现在终于明白你想要什么了。很抱歉让你等这么久。

我认为sklearn中还没有实现你所需的拆分方法。但我们可以轻松地扩展BaseCrossValidator方法。

import numpy as np
from sklearn.model_selection import BaseCrossValidator
from sklearn.utils.validation import check_array

class GroupOfGroups(BaseCrossValidator):
    def __init__(self, group_of_groups):
        """
        :param group_of_groups: list with length n_splits. Each entry in the list is a list with group ids from
 set(groups). In each of the n_splits splits, the groups given in the current group_of_groups sublist are used 
for validation.
        """
        self.group_of_groups = group_of_groups

    def get_n_splits(self, X=None, y=None, groups=None):
        return len(self.group_of_groups)

    def _iter_test_masks(self, X=None, y=None, groups=None):
        if groups is None:
            raise ValueError("The 'groups' parameter should not be None.")
        groups=check_array(groups, copy=True, ensure_2d=False, dtype=None)
        for g in self.group_of_groups:
            test_index = np.zeros(len(groups), dtype=np.bool)
            for g_id in g:
                test_index[groups == g_id] = True
            yield test_index

使用方法非常简单。与之前一样，我们定义 X,y 和 groups。另外，我们定义了一个列表的列表（组的组），它定义了哪些组应该在哪个测试折中一起使用。

因此，g_of_g=[[1,2],[2,3],[3,4]] 意味着在第一次交叉验证中，组 1 和 2 作为测试集使用，而剩下的组 3 和 4 则用于训练。在第二次交叉验证中，来自组 2 和 3 的数据将被用作测试集等等。

我对“GroupOfGroups”的命名不太满意，也许你能找到更好的名字。

现在我们可以测试这个交叉验证器：

X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10, 0.1, 0.2, 2.2]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d", "a", "b", "b"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4]
g_of_g = [[1,2],[2,3],[3,4]]
gg = GroupOfGroups(g_of_g)
print("n_splits=", gg.get_n_splits(X,y,groups))
for train_index, test_index in gg.split(X, y, groups):
    print("train_idx:", train_index, "test_idx:", test_index)

输出：

n_splits= 3
train_idx: [ 6  7  8  9 10 11 12] test_idx: [0 1 2 3 4 5]
train_idx: [ 0  1  2 10 11 12] test_idx: [3 4 5 6 7 8 9]
train_idx: [0 1 2 3 4 5] test_idx: [ 6  7  8  9 10 11 12]

请记住，我没有包括很多检查，也没有进行彻底的测试。因此，请仔细验证这是否适用于您。