caret训练是如何确定概率阈值以最大化特异性的？

Question

caret训练是如何确定概率阈值以最大化特异性的？

4

我正在使用caret的twoClassSummary函数来确定最优模型超参数以最大化“特异性”。然而，该函数如何确定最大化“特异性”的概率阈值呢？在每个模型超参数/折叠中，caret是否会评估0到1之间的每个阈值，并返回最大“特异性”？在下面的示例中，您可以看到该模型已落在cp = 0.01492537上。

# load libraries
library(caret)
library(mlbench)
# load the dataset
data(PimaIndiansDiabetes)
# prepare resampling method
control <- trainControl(method="cv", 
                        number=5, 
                        classProbs=TRUE,
                        summaryFunction=twoClassSummary)

set.seed(7)
fit <- train(diabetes~., 
             data=PimaIndiansDiabetes, 
             method="rpart", 
             tuneLength= 5,
             metric="Spec", 
             trControl=control)

print(fit)


CART 

768 samples
  8 predictor
  2 classes: 'neg', 'pos' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 614, 614, 615, 615, 614 
Resampling results across tuning parameters:

  cp          ROC        Sens   Spec     
  0.01305970  0.7615943  0.824  0.5937806
  0.01492537  0.7712055  0.824  0.6016073
  0.01741294  0.7544469  0.830  0.5976939
  0.10447761  0.6915783  0.866  0.5035639
  0.24253731  0.6437820  0.884  0.4035639

Spec was used to select the optimal model using  the largest value.
The final value used for the model was cp = 0.01492537.

- pmanDS

您可以通过以下链接查看不同阈值下的性能指标：https://rpubs.com/phamdinhkhanh/390642 - PleaseHelp

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- carpa_jo · Accepted Answer

不，twoClassSummary并不会评估0到1之间的每一个阈值。它只返回标准阈值0.5的值。 twoClassSummary的定义如下：

 function (data, lev = NULL, model = NULL) 
{
    lvls <- levels(data$obs)
    if (length(lvls) > 2) 
        stop(paste("Your outcome has", length(lvls), "levels. The twoClassSummary() function isn't appropriate."))
    requireNamespaceQuietStop("ModelMetrics")
    if (!all(levels(data[, "pred"]) == lvls)) 
        stop("levels of observed and predicted data do not match")
    rocAUC <- ModelMetrics::auc(ifelse(data$obs == lev[2], 0, 
        1), data[, lvls[1]])
    out <- c(rocAUC, sensitivity(data[, "pred"], data[, "obs"], 
        lev[1]), specificity(data[, "pred"], data[, "obs"], lev[2]))
    names(out) <- c("ROC", "Sens", "Spec")
    out
}

要验证我的说法，请尝试使用自定义的 summaryFunction 进行以下示例，其中我明确将阈值设置为0.5，您会发现两个值 Spec（由twoClassSummary报告的原始特异性）和 Spec2（手动将特异性阈值设置为0.5）将完全相同。

# load libraries
library(caret)
library(mlbench)
# load the dataset
data(PimaIndiansDiabetes)

# define custom summaryFunction
customSummary <- function (data, lev = NULL, model = NULL){
  spec <- specificity(data[, "pred"], data[, "obs"], lev[2])
  pred <- factor(ifelse(data[, "neg"] > 0.5, "neg", "pos"))
  spec2 <- specificity(pred, data[, "obs"], "pos")
  out <- c(spec, spec2)

  names(out) <- c("Spec", "Spec2")
  out
}

# prepare resampling method
control <- trainControl(method="cv", 
                        number=5, 
                        classProbs=TRUE,
                        summaryFunction=customSummary)

set.seed(7)
fit <- train(diabetes~., 
             data=PimaIndiansDiabetes, 
             method="rpart", 
             tuneLength= 5,
             metric="Spec", 
             trControl=control)

print(fit)
CART 

768 samples
  8 predictor
  2 classes: 'neg', 'pos' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 615, 615, 614, 614, 614 
Resampling results across tuning parameters:

  cp          Spec       Spec2    
  0.01305970  0.5749825  0.5749825
  0.01492537  0.5411600  0.5411600
  0.01741294  0.5596785  0.5596785
  0.10447761  0.4932215  0.4932215
  0.24253731  0.2837177  0.2837177

Spec was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.0130597.

另外，如果您想让caret为任何阈值计算超参数设置的最大特异性并报告该值，您可以定义一个自定义的summaryFunction，如下所示，它将尝试从0.1到0.95以0.05步长的所有阈值：

    # define custom summaryFunction
customSummary <- function (data, lev = NULL, model = NULL){
  spec <- specificity(data[, "pred"], data[, "obs"], lev[2])
  pred <- factor(ifelse(data[, "neg"] > 0.5, "neg", "pos"))
  spec2 <- specificity(pred, data[, "obs"], "pos")
  speclist <- as.numeric()
  for(i in seq(0.1, 0.95, 0.05)){
    predi <- factor(ifelse(data[, "neg"] > i, "neg", "pos"))
    singlespec <- specificity(predi, data[, "obs"], "pos")
    speclist <- c(speclist, singlespec)
  }
  max(speclist) -> specmax

  out <- c(spec, spec2, specmax)

  names(out) <- c("Spec", "Spec2", "Specmax")
  out
}