在生成SHAP值后,当我使用shap.plots.waterfall时出现了错误。

8

对于以下给出的代码,如果我只使用命令shap.plots.waterfall(shap_values[6]),将会报错:

'numpy.ndarray' 对象没有 'base_values' 属性

我必须先运行以下两个命令:

explainer2 = shap.Explainer(clf.best_estimator_.predict, X_train)
shap_values = explainer2(X_train)

然后运行waterfall命令以获取正确的绘图。以下是错误发生的示例:

from sklearn.datasets import make_classification
import seaborn as sns
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import pickle
import joblib
import warnings
import shap
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

f, (ax1,ax2) = plt.subplots(nrows=1, ncols=2,figsize=(20,8))
# Generate noisy Data
X_train,y_train = make_classification(n_samples=1000, 
                          n_features=50, 
                          n_informative=9, 
                          n_redundant=0, 
                          n_repeated=0, 
                          n_classes=10, 
                          n_clusters_per_class=1,
                          class_sep=9,
                          flip_y=0.2,
                          #weights=[0.5,0.5], 
                          random_state=17)

X_test,y_test = make_classification(n_samples=500, 
                          n_features=50, 
                          n_informative=9, 
                          n_redundant=0, 
                          n_repeated=0, 
                          n_classes=10, 
                          n_clusters_per_class=1,
                          class_sep=9,
                          flip_y=0.2,
                          #weights=[0.5,0.5], 
                          random_state=17)

model = RandomForestClassifier()

parameter_space = {
    'n_estimators': [10,50,100],
    'criterion': ['gini', 'entropy'],
    'max_depth': np.linspace(10,50,11),
}

clf = GridSearchCV(model, parameter_space, cv = 5, scoring = "accuracy", verbose = True) # model
my_model = clf.fit(X_train,y_train)
print(f'Best Parameters: {clf.best_params_}')

# save the model to disk
filename = f'Testt-RF.sav'
pickle.dump(clf, open(filename, 'wb'))

explainer = Explainer(clf.best_estimator_)
shap_values_tr1 = explainer.shap_values(X_train)

shap.plots.waterfall(shap_values[6])

你能告诉我生成shap.plots.waterfall的正确步骤,以查看train数据吗?

谢谢!


1
你不能使用一个数据集来训练模型然后在另一个数据集上进行预测。机器学习的整个重点在于两者都应该来自相似(或相同)的分布。 - Sergey Bushmanov
2个回答

6
以下是我用过的方法:
from sklearn.datasets import make_classification
from shap import Explainer, Explanation
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from shap import waterfall_plot

X, y = make_classification(1000, 50, n_informative=9, n_classes=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.75, random_state=42)
model = RandomForestClassifier()
model.fit(X_train, y_train)

explainer = Explainer(model)
sv = explainer(X_train)

exp = Explanation(sv[:,:,6], sv.base_values[:,6], X_train, feature_names=None)
idx = 7 # datapoint to explain
waterfall_plot(exp[idx])

enter image description here


1
是的,两者都可以,但是您可以通过比较shap预测和模型预测来直观地检查它们是否相等,当输入相同的数据时。 - Sergey Bushmanov
非常感谢你的解决方案。我只是想对你的解决方案进行一些澄清。在你的例子中,X_train 的形状为 (750 x 50)。750 个样本和 50 个特征。对于代码行 exp = Explanation(sv[:,:,6], sv.base_values[:,6], X_train, feature_names=None),这是指我们正在查看第七类吗?而 idx 是样本的第八行吗? - Joe
1
再次感谢!我可以轻松地检查模型对idx 7的预测,如predictions = model.predict(X_train); y_pred = predictions; y_pred[7]。我如何检查shap预测,以查看是否也得到相同的答案? - Joe
1
你在图表右上角看到的 0.67 是类别 1 的概率。 - Sergey Bushmanov
我在sv上遇到了一个错误:数组的索引太多:数组。我无法像sv[:,:,6]这样访问它。我的sv只包含.values.base_values.data。有什么提示吗? - undefined

1
这对我有效:
shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], shap_values[0].values, df.values[0], feature, max_display=20)

enter image description here


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接