XGboost Python - 分类器类权重选项？

Question

XGboost Python - 分类器类权重选项？

scikit-learnxgboost

20

有没有办法为xgboost分类器设置不同的类权重？例如，在sklearn的RandomForestClassifier中，这是通过“class_weight”参数完成的。

- Fiction

1

注意：由于不再支持 sample_weight，因此以下所有解决方案都无效。 - SriK

1

scale_pos_weight是正确的参数。请看下面我的答案。 - SriK

@SriK 是的，但它只适用于二元分类问题。 - onofricamila

@SriK 我在机器学习方面并不是很资深的员工，但根据我在scikit-learn版本的XGBoost中所看到的，我们确实有样本权重可用，并且刚刚在我对罕见疾病的研究中表现得非常出色。https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn - Simon Provost

7个回答

12

当使用 sklearn 的包装器时，有一个权重参数。

例如：

import xgboost as xgb
exgb_classifier = xgboost.XGBClassifier()
exgb_classifier.fit(X, y, sample_weight=sample_weights_data)

参数应该是类似数组的，长度为N，等于目标长度

- epattaro

你如何在Pipeline中使用它？因为你不能直接在管道内使用fit。 - Deshwal

@Deshwal，由于这是一种不同类型的查询，我不想深入回答与原始问题无关的内容，这里有一篇不错的文章讨论这样的事情：https://towardsdatascience.com/pipelines-custom-transformers-in-scikit-learn-the-step-by-step-guide-with-python-code-4a7d9b068156 - Simon Provost

7

我最近遇到了这个问题，所以想留下我尝试的解决方案。

from xgboost import XGBClassifier

# manually handling imbalance. Below is same as computing float(18501)/392318 
on the trainig dataset.
# We are going to inversely assign the weights
weight_ratio = float(len(y_train[y_train == 0]))/float(len(y_train[y_train == 
1]))
w_array = np.array([1]*y_train.shape[0])
w_array[y_train==1] = weight_ratio
w_array[y_train==0] = 1- weight_ratio

xgc = XGBClassifier()
xgc.fit(x_df_i_p_filtered, y_train, sample_weight=w_array)

不太确定原因，但结果相当令人失望。希望能对某些人有所帮助。

[参考链接] https://www.programcreek.com/python/example/99824/xgboost.XGBClassifier

- Pramit

2

应该是 w1 = np.array([1.0] * y_train.shape[0])，将numpy数组的dtype初始化为float。否则，以下语句将导致numpy数组包含所有零。 - Diego Amicabile

3

这里的答案已经过时了。sample_weight参数不再支持，已被scale_pos_weight替代。只需使用scale_pos_weight = 负实例数量之和 / 正实例数量之和。

- SriK

2

是的，但这仅适用于二元分类问题。 - onofricamila

3

from sklearn.utils.class_weight import compute_sample_weight
xgb_classifier.fit(X, y, sample_weight=compute_sample_weight("balanced", y))

- Tianhuang Su

2

在开始编写代码之前，请先对您的答案进行一些解释。 - Ofek Hod

0

与@Firas Omrane和@Pramit的答案类似，但我认为它稍微更符合Python风格


    from sklearn.utils import class_weight
    class_weights = dict(
            zip(
                [0,1],
                class_weight.compute_class_weight(
                    'balanced', classes=np.unique(train['class']), y=train['class']
                ),
            )
        ) 
    
    xgb_classifier.fit(X, train['class'], sample_weight=class_weights)

- skibee

这个 class_weights 的格式不符合 xgb 的预期。请问是否需要做额外的工作让它能够正常工作？谢谢。 - juanbretti

1

@juanbretti使用Skibee的回答将无法与Scikit-xgboost learn的实现配合使用，因为它们需要一个类似于您的目标类别（即相同大小）的列表，但其中包含该i th的权重值，而不是1、0或列中唯一值。因此，这个答案非常适合记录应该应用于您的唯一值的类别权重，例如。然而，我建议在使用XGBoost Scikit Learn实现时使用class weight.compute sample weight。你明白了吗？还是有疑问？ - Simon Provost

0

你也可以使用 scale_pos_weight 超参数，如 XGBoost 文档中所讨论的。这种方法的优点是你不需要构建样本权重向量，并且在 fit 时间不需要传递样本权重向量。

- skeller88

有趣。我尝试了一下我的问题，我的问题是这种方法与fit方法中的sample_weight有何不同？如果您对此有任何见解，那就太棒了。 - Simon Provost

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Firas Omrane · Accepted Answer

对于sklearn版本<0.19

只需为您的训练数据的每个条目分配其类权重。首先使用sklearn中的class_weight.compute_class_weight获取类权重，然后为训练数据的每一行分配其适当的权重。

我假设在这里训练数据具有包含类号码的class列。我还假设有nb_classes ，这些类别从1到nb_classes。

from sklearn.utils import class_weight
classes_weights = list(class_weight.compute_class_weight('balanced',
                                             np.unique(train_df['class']),
                                             train_df['class']))

weights = np.ones(y_train.shape[0], dtype = 'float')
for i, val in enumerate(y_train):
    weights[i] = classes_weights[val-1]

xgb_classifier.fit(X, y, sample_weight=weights)

sklearn 版本更新至 0.19 及以上

有更简单的解决方案

from sklearn.utils import class_weight
classes_weights = class_weight.compute_sample_weight(
    class_weight='balanced',
    y=train_df['class']
)

xgb_classifier.fit(X, y, sample_weight=classes_weights)