Scikit Learn 如何计算多类分类的 f1_macro？

Question

Scikit Learn 如何计算多类分类的 f1_macro？

5

我认为Scikit中用于多类分类的f1_macro将会使用以下公式进行计算：

2 * Macro_precision * Macro_recall / (Macro_precision + Macro_recall)

但是手动检查显示结果不同，比scikit计算的值略高。我查阅了文档，没有找到公式。

例如，鸢尾花数据集产生了以下结果：

from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
data=pd.DataFrame({
    'sepal length':iris.data[:,0],
    'sepal width':iris.data[:,1],
    'petal length':iris.data[:,2],
    'petal width':iris.data[:,3],
    'species':iris.target
})

X=data[['sepal length', 'sepal width', 'petal length', 'petal width']]
y=data['species']  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
clf=RandomForestClassifier(n_estimators=100)

clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

#Compute metrics using scikit
from sklearn import metrics
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))
pre_macro = metrics.precision_score(y_test, y_pred, average="macro")
recall_macro = metrics.recall_score(y_test, y_pred, average="macro")
f1_macro_scikit = metrics.f1_score(y_test, y_pred, average="macro")
print ("Prec_macro_scikit:", pre_macro)
print ("Rec_macro_scikit:", recall_macro)
print ("f1_macro_scikit:", f1_macro_scikit)

输出：

Prec_macro_scikit: 0.9555555555555556
Rec_macro_scikit: 0.9666666666666667
f1_macro_scikit: 0.9586466165413534

然而，使用以下内容进行手动计算：

f1_macro_manual = 2 * pre_macro * recall_macro / (pre_macro + recall_macro )

yields:

f1_macro_manual: 0.9610789980732178

我正试图弄清楚这种差异。

- Anderlecht

1

展示一下可以重现问题的代码。你是否确实打开了宏平均值？ - user2357112

@user2357112 代码已经更新。 - Anderlecht

2个回答

2

最终更新:

在 user2357112 的非常有价值的评论（请看他/她下面的回答）以及阅读网络上的几个误解和错误信息后，我不得不对 宏平均 f1-score 公式进行一些调查。正如 user2357112 也透露的那样（实际上是第一个），f1_macro 的算法与您在手动计算中使用的略有不同。最终，我找到了一个可靠的来源。

证明 sklearn 使用此公式:

来自 sklearn 的 classification.py 模块的 precision_recall_fscore_support() 方法的片段:

    precision = _prf_divide(tp_sum, pred_sum,
                            'precision', 'predicted', average, warn_for)
    recall = _prf_divide(tp_sum, true_sum,
                         'recall', 'true', average, warn_for)
    # Don't need to warn for F: either P or R warned, or tp == 0 where pos
    # and true are nonzero, in which case, F is well-defined and zero

    f_score = ((1 + beta2) * precision * recall /
               (beta2 * precision + recall))

    f_score[tp_sum == 0] = 0.0

# Average the results

if average == 'weighted':
    weights = true_sum
    if weights.sum() == 0:
        return 0, 0, 0, None
elif average == 'samples':
    weights = sample_weight
else:
    weights = None

if average is not None:
    assert average != 'binary' or len(precision) == 1

    precision = np.average(precision, weights=weights)
    recall = np.average(recall, weights=weights)
    f_score = np.average(f_score, weights=weights)

    true_sum = None  # return no support

return precision, recall, f_score, true_sum

我们可以看到，sklearn在计算精确率和召回率的平均值之前进行了最终的平均值计算：

precision = np.average(precision, weights=weights)
recall = np.average(recall, weights=weights)
f_score = np.average(f_score, weights=weights)

最后稍微修改了您的代码：

from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.model_selection import train_test_split


iris = datasets.load_iris()
data=pd.DataFrame({
    'sepal length':iris.data[:,0],
    'sepal width':iris.data[:,1],
    'petal length':iris.data[:,2],
    'petal width':iris.data[:,3],
    'species':iris.target
})

X=data[['sepal length', 'sepal width', 'petal length', 'petal width']]
y=data['species']  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
clf=RandomForestClassifier(n_estimators=100)

clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

#Compute metrics using scikit
from sklearn import metrics
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))
pre_macro = metrics.precision_score(y_test, y_pred, average="macro")
recall_macro = metrics.recall_score(y_test, y_pred, average="macro")
f1_macro_scikit = metrics.f1_score(y_test, y_pred, average="macro")

f1_score_raw = metrics.f1_score(y_test, y_pred, average=None)

f1_macro_manual = f1_score_raw.mean()

print ("Prec_macro_scikit:", pre_macro)
print ("Rec_macro_scikit:", recall_macro)
print ("f1_macro_scikit:", f1_macro_scikit)

print("f1_score_raw:", f1_score_raw)
print("f1_macro_manual:", f1_macro_manual)

输出：

[[16  0  0]
 [ 0 15  0]
 [ 0  6  8]]
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        16
          1       0.71      1.00      0.83        15
          2       1.00      0.57      0.73        14

avg / total       0.90      0.87      0.86        45

Prec_macro_scikit: 0.9047619047619048
Rec_macro_scikit: 0.8571428571428571
f1_macro_scikit: 0.8535353535353535
f1_score_raw: [1.         0.83333333 0.72727273]
f1_macro_manual: 0.8535353535353535

您可以采用类似您之前的“手动计算”的方式，来进行计算：

import numpy as np

pre = metrics.precision_score(y_test, y_pred, average=None)
recall = metrics.recall_score(y_test, y_pred, average=None)

f1_macro_manual = 2 * pre * recall / (pre + recall )
f1_macro_manual = np.average(f1_macro_manual)

print("f1_macro_manual_2:", f1_macro_manual)

输出：

f1_macro_manual_2: 0.8535353535353535

- Geeocode

sklearn的平均值被改为“宏平均”。我没有使用二进制。 - Anderlecht

我认为这篇博客犯了和Andy G一样的错误——宏平均F1分数不是宏平均精确率和召回率的调和平均值。那个调和平均值是Andy G在问题中提到的2 * Macro_precision * Macro_recall / (Macro_precision + Macro_recall)。宏平均F1分数必须通过对F1分数进行宏平均来计算，而不是从宏平均精确率和召回率计算得出。 - user2357112

@user2357112 你说得完全正确，我进行了最终更新和相关调查。谢谢。请查看我的更新。 - Geeocode

新的踩并不是我点的；我认为他们可能被我现在已过时的评论所困惑，那个评论现在有1个赞。 - user2357112

@user2357112 非常感谢。我有点失望，因为我确实对该领域进行了全面的解释和研究。但就是这样。在这种情况下，如果您能取消您的相关评论，并且如果您认为我在我的主要答案中更加重视您的评论，只是为了为未来的读者澄清情况，那将是明智的。 - Geeocode

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- user2357112 · Accepted Answer

Macro-averaging不是这样工作的。宏平均f1分数不是从宏平均精确度和召回率计算出来的。

因此，使用average='macro'计算f1_score会为每个类别计算f1分数，并返回这些分数的平均值。

宏平均通过计算每个类别的指标值并返回各个值的未加权平均值来计算。如果您想自己计算宏平均值，请指定average=None以获取每个类别的二进制f1分数数组，然后取该数组的mean()。

binary_scores = metrics.f1_score(y_test, y_pred, average=None)
manual_f1_macro = binary_scores.mean()

可运行演示这里。