如何在PySpark管道中使用XGBoost

Question

12

我想更新我的pyspark代码。在pyspark中，必须将基本模型放入管道中，管道的office demo使用逻辑回归作为基本模型。然而，似乎无法在管道API中使用XGboost模型。我该如何像这样使用pyspark？

from xgboost import XGBClassifier
...
model = XGBClassifier()
model.fit(X_train, y_train)
pipeline = Pipeline(stages=[..., model, ...])
...

使用管道API非常方便，有人能给出一些建议吗？谢谢。

- Daniel Du

3个回答

5

如上所述，有一个维护良好（被多家公司用于生产）的分布式 XGBoost 库（https://github.com/dmlc/xgboost），但是要从 PySpark 中使用它有点棘手，有人为该库的 0.72 版本制作了一个可用的 pyspark 封装，支持 0.8 版本正在进行中。

请确保 xgboost jars 在您的 pyspark jar 路径中。

- Rafael

5

这里有一个适用于Spark 2.4及以上版本的XBoost实现:

请注意，这是一个外部库，但它应该很容易与spark一起使用。

- Rafael Larios

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Pierre Gourseaud · Accepted Answer

截至版本2.3，Apache Spark ML中没有XGBoost分类器。可用的模型在此处列出：https://spark.apache.org/docs/2.3.0/ml-classification-regression.html 如果您想使用XGBoost，应该不使用pyspark（使用.toPandas()将您的Spark DataFrame转换为Pandas DataFrame），或者使用另一个算法（https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#module-pyspark.ml.classification）。

但是，如果您真的想在pyspark中使用XGBoost，则必须深入研究pyspark以实现分布式XGBoost。这里有一篇文章介绍如何做到：http://dmlc.ml/2016/10/26/a-full-integration-of-xgboost-and-spark.html。