如何使用sklearn获取K-Fold交叉验证的平均分数

Question

如何使用sklearn获取K-Fold交叉验证的平均分数

6

我使用sklearn应用决策树和K-fold方法，希望有人能够帮助我展示其平均得分。以下是我的代码：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix,classification_report

dta=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data")

X=dta.drop("whether he/she donated blood in March 2007",axis=1)

X=X.values # convert dataframe to numpy array

y=dta["whether he/she donated blood in March 2007"]

y=y.values # convert dataframe to numpy array

kf = KFold(n_splits=10)

KFold(n_splits=10, random_state=None, shuffle=False)

clf_tree=DecisionTreeClassifier()

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf=clf_tree.fit(X_train,y_train)
    print("classification_report_tree", 
           classification_report(y_test,clf_tree.predict(X_test)))

- Ngọc Vũ Đình

“平均分数”是什么意思？您只需要准确度吗？还是还需要召回率、精确度和f1（因为您正在打印分类报告）? - Vivek Kumar

我想要用 K 折交叉验证运行决策树，并显示总体准确率。K 折交叉验证的次数为 10 次，每次运行都会给出 1 次准确率。如何显示训练的总体准确率？ - Ngọc Vũ Đình

2个回答

4

你可以尝试使用来自sklearn的Precision_reacll_fscore_support指标，然后对每个折叠的结果进行平均以获取每个类别的平均分数。我假设这里你需要每个类别的得分平均值。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_recall_fscore_support
from sklearn.model_selection import GridSearchCV,cross_val_score

dta=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data")

X=dta.drop("whether he/she donated blood in March 2007",axis=1)

X=X.values # convert dataframe to numpy array

y=dta["whether he/she donated blood in March 2007"]

y=y.values # convert dataframe to numpy array

kf = KFold(n_splits=10)

KFold(n_splits=10, random_state=None, shuffle=False)

clf_tree=DecisionTreeClassifier()

score_array =[]
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf=clf_tree.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    score_array.append(precision_recall_fscore_support(y_test, y_pred, average=None))

avg_score = np.mean(score_array,axis=0)
print(avg_score)

#Output:
#[[  0.77302466   0.30042282]
# [  0.81755068   0.22192344]
# [  0.79063779   0.24414489]
# [ 57.          17.8       ]]

现在，要获得0级精度，您可以使用avg_score [0] [0]。召回率可以通过第二行访问（即对于类别0，它是avg_score [1] [0]），而fscore和支持可以分别从第3行和第4行访问。

- Gambit1614

2

虽然另一个答案在技术上是正确的，但这个答案还展示了如何实际训练模型！ :) - jcr

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Vivek Kumar · Accepted Answer

如果你只需要准确性，那么可以简单地使用 cross_val_score()。

kf = KFold(n_splits=10)
clf_tree=DecisionTreeClassifier()
scores = cross_val_score(clf_tree, X, y, cv=kf)

avg_score = np.mean(score_array)
print(avg_score)

在这里，cross_val_score将使用您的原始X和y作为输入（不分割成训练集和测试集）。cross_val_score将自动将它们拆分为训练集和测试集，在训练数据上拟合模型并在测试数据上进行评分。那些分数将在scores变量中返回。

因此，当您有10个折叠时，scores变量中将返回10个分数。然后，您只需对其取平均值即可。