XGBoost和稀疏矩阵

Question

XGBoost和稀疏矩阵

7

我将尝试使用xgboost在python上解决分类问题，我的数据储存在一个numpy矩阵X中（行 = 观测值 & 列 = 特征），标签则存储于numpy数组y中。由于我的数据是稀疏的，因此我希望能够使用X的稀疏版本运行，但似乎出现了错误。以下是我的操作步骤：

# Library import

import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from scipy.sparse import csr_matrix

# Converting to sparse data and running xgboost

X_csr = csr_matrix(X)
xgb1 = XGBClassifier()
xgtrain = xgb.DMatrix(X_csr, label = y )      #to work with the xgb format
xgtest = xgb.DMatrix(Xtest_csr)
xgb1.fit(xgtrain, y, eval_metric='auc')
dtrain_predictions = xgb1.predict(xgtest)

当我尝试拟合分类器时，现在出现了错误：

File ".../xgboost/python-package/xgboost/sklearn.py", line 432, in fit
self._features_count = X.shape[1]

AttributeError: 'DMatrix' object has no attribute 'shape'

现在，我花了一些时间想一下可能出现这种情况的原因，相信是因为我想使用的稀疏格式有关。但是具体是什么以及如何解决，我毫无头绪。

非常欢迎任何帮助或评论！非常感谢。

- PLV

这个能和 X 一起工作吗？xgb 对使用稀疏矩阵有什么看法？它们通常不能完全替代。 - hpaulj

4个回答

0

问题出现是由于DMatrix..num_col()仅返回稀疏矩阵中非零列的数量。
使用scipy.sparse.coo_matrix.tocsc将此矩阵转换为压缩稀疏列格式。
您可以参考http://github.com/dmlc/xgboost/issues/1238#issuecomment-243872543

- Amey Laddad

0

X_csr = csr_matrix(X) 具有与 X 相同的许多属性，包括 .shape。但它不是子类，也不是一种替代品。代码需要“稀疏感知”。sklearn 符合要求；事实上，它还添加了许多自己的快速稀疏实用函数。

但我不知道 xgb 如何处理稀疏矩阵，也不知道它如何与 sklearn 协作。

假设问题出在 xgtrain 上，您需要查看其类型和属性。它与使用 xgb.DMatrix(X, label = y ) 制作的那个有什么区别？

如果您想从一个不是 xgboost 用户的人那里获得帮助，您需要提供关于代码中对象的更多信息。

- hpaulj

0

我更喜欢使用XGBoost训练包装器而不是XGBoost sklearn包装器。您可以按照以下方式创建分类器：

params = {
    # I'm assuming you are doing binary classification
    'objective':'binary:logistic'
    # any other training params here
    # full parameter list here https://github.com/dmlc/xgboost/blob/master/doc/parameter.md
}
booster = xgb.train(params, xgtrain, metrics=['auc'])

这个API还内置了一个交叉验证xgb.cv，它与XGBoost配合得更好。

https://xgboost.readthedocs.io/en/latest/get_started/index.html#python

这里有更多的示例 https://github.com/dmlc/xgboost/tree/master/demo/guide-python

希望这能帮到你。

- volker238

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- A.A. · Accepted Answer

您正在使用xgboost scikit-learn API (http://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn)，因此不需要将数据转换为DMatrix以适配XGBClassifier()。只需删除该行即可。

xgtrain = xgb.DMatrix(X_csr, label = y )

应该可以工作:

type(X_csr) #scipy.sparse.csr.csr_matrix
type(y) #numpy.ndarray
xgb1 = xgb.XGBClassifier()
xgb1.fit(X_csr, y)

输出结果为：

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
   gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
   min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
   objective='binary:logistic', reg_alpha=0, reg_lambda=1,
   scale_pos_weight=1, seed=0, silent=True, subsample=1)