在scikit-learn中获取二元概率分类器的最高准确性

7

在scikit-learn中,是否有任何内置函数可以获取二元概率分类器的最大准确性?

例如,要获取最大F1分数,我执行以下操作:

# AUCPR
precision, recall, thresholds = sklearn.metrics.precision_recall_curve(y_true, y_score)    
auprc  = sklearn.metrics.auc(recall, precision)
max_f1 = 0
for r, p, t in zip(recall, precision, thresholds):
    if p + r == 0: continue
    if (2*p*r)/(p + r) > max_f1:
        max_f1 = (2*p*r)/(p + r) 
        max_f1_threshold = t

我可以用类似的方式计算最大准确性:

accuracies = []
thresholds = np.arange(0,1,0.1)
for threshold in thresholds:
    y_pred = np.greater(y_score, threshold).astype(int)
    accuracy = sklearn.metrics.accuracy_score(y_true, y_pred)
    accuracies.append(accuracy)

accuracies = np.array(accuracies)
max_accuracy = accuracies.max() 
max_accuracy_threshold =  thresholds[accuracies.argmax()]

但我想知道是否有任何内置函数。


嗨,Franck,你找到内置函数了吗?因为我现在也在寻找同样的东西。 - Geeocode
1
@GeorgeSolymosi 我没有找到内置的函数。 - Franck Dernoncourt
1
谢谢提醒,注意行 accuracy = np.array(accuracy) 应该改为 accuracy = np.array(accuracies) 或类似的代码 :) - Geeocode
@GeorgeSolymosi 谢谢,发现得好! - Franck Dernoncourt
顺便说一下,Franck的代码很好、清晰且透明! - Geeocode
2个回答

6
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_true, probs)
accuracy_scores = []
for thresh in thresholds:
    accuracy_scores.append(accuracy_score(y_true, [m > thresh for m in probs]))

accuracies = np.array(accuracy_scores)
max_accuracy = accuracies.max() 
max_accuracy_threshold =  thresholds[accuracies.argmax()]


2
我开始通过将 thresholds = np.arange(0,1,0.1) 转换为更智能、二分查找的方式来改进解决方案。
然后我意识到,在工作了2个小时后,获取所有准确性要比仅查找最大值便宜得多!(是的,这完全是违反直觉的。)
我在下面写了很多注释来解释我的代码。随意删除所有这些以使代码更易读。
import numpy as np

# Definition : we predict True if y_score > threshold
def ROC_curve_data(y_true, y_score):
    y_true  = np.asarray(y_true,  dtype=np.bool_)
    y_score = np.asarray(y_score, dtype=np.float_)
    assert(y_score.size == y_true.size)

    order = np.argsort(y_score) # Just ordering stuffs
    y_true  = y_true[order]
    # The thresholds to consider are just the values of score, and 0 (accept everything)
    thresholds = np.insert(y_score[order],0,0)
    TP = [sum(y_true)] # Number of True Positives (For Threshold = 0 => We accept everything => TP[0] = # of postive in true y)
    FP = [sum(~y_true)] # Number of True Positives (For Threshold = 0 => We accept everything => TP[0] = # of postive in true y)
    TN = [0] # Number of True Negatives (For Threshold = 0 => We accept everything => we don't have negatives !)
    FN = [0] # Number of True Negatives (For Threshold = 0 => We accept everything => we don't have negatives !)

    for i in range(1, thresholds.size) : # "-1" because the last threshold
        # At this step, we stop predicting y_score[i-1] as True, but as False.... what y_true value say about it ?
        # if y_true was True, that step was a mistake !
        TP.append(TP[-1] - int(y_true[i-1]))
        FN.append(FN[-1] + int(y_true[i-1]))
        # if y_true was False, that step was good !
        FP.append(FP[-1] - int(~y_true[i-1]))
        TN.append(TN[-1] + int(~y_true[i-1]))

    TP = np.asarray(TP, dtype=np.int_)
    FP = np.asarray(FP, dtype=np.int_)
    TN = np.asarray(TN, dtype=np.int_)
    FN = np.asarray(FN, dtype=np.int_)

    accuracy    = (TP + TN) / (TP + FP + TN + FN)
    sensitivity = TP / (TP + FN)
    specificity = TN / (FP + TN)
    return((thresholds, TP, FP, TN, FN))

整个过程只是一个单一的循环,算法非常简单。 实际上,这个愚蠢简单的函数比我之前提出的解决方案(计算thresholds = np.arange(0,1,0.1)的准确性)快10倍,比我以前的聪明二分算法快30倍... 然后你可以轻松地计算任何你想要的KPI,例如:
def max_accuracy(thresholds, TP, FP, TN, FN) :
    accuracy    = (TP + TN) / (TP + FP + TN + FN)
    return(max(accuracy))

def max_min_sensitivity_specificity(thresholds, TP, FP, TN, FN) :
    sensitivity = TP / (TP + FN)
    specificity = TN / (FP + TN)
    return(max(np.minimum(sensitivity, specificity)))

如果您想进行测试:

如果您想进行测试:

y_score = np.random.uniform(size = 100)
y_true = [np.random.binomial(1, p) for p in y_score]
data = ROC_curve_data(y_true, y_score)

%matplotlib inline # Because I personnaly use Jupyter, you can remove it otherwise
import matplotlib.pyplot as plt
plt.step(data[0], data[1])
plt.step(data[0], data[2])
plt.step(data[0], data[3])
plt.step(data[0], data[4])
plt.show()

print("Max accuracy is", max_accuracy(*data))
print("Max of Min(Sensitivity, Specificity) is", max_min_sensitivity_specificity(*data))

祝愉快 ;)


2
这样做的缺点是,特别是对于不平衡的数据集,得分中大部分变化可能在第一个或最后一个区间内。 一种更好的方法是为每个唯一的(tp,fp,fn,tn)计算阈值、tp、fp、fn和tn。这可以在单次扫描中高效地完成(scikit在计算AUCROC时内部执行此操作)。 - user48956

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接