我想在使用caret包进行GBM运行时，使用AUPRC作为绩效度量。我如何使用自定义度量，例如AUPRC？

Question

我想在使用caret包进行GBM运行时，使用AUPRC作为绩效度量。我如何使用自定义度量，例如AUPRC？

3

我正在尝试将AUPRC用作我的gbm模型拟合的自定义度量标准，因为我有一个不平衡的分类器。但是，当我尝试将自定义度量标准纳入其中时，我得到了代码中提到的以下错误。不确定我做错了什么。

另外，当我在内联运行auprcSummary（）时，它可以正常工作。但是，当我尝试将其纳入train（）中时，它会给我一个错误。

     library(dplyr) # for data manipulation
     library(caret) # for model-building
     library(pROC) # for AUC calculations
     library(PRROC) # for Precision-Recall curve calculations

    auprcSummary <- function(data, lev = NULL, model = NULL){
      index_class2 <- data$Class == "Class2"
      index_class1 <- data$Class == "Class1"
      the_curve <- pr.curve(data$Class[index_class2],
                    data$Class[index_class1],
                    curve = FALSE)
      out <- the_curve$auc.integral
      names(out) <- "AUPRC"
      out
      }

    ctrl <- trainControl(method = "repeatedcv",
                 number = 10,
                 repeats = 5,
                 summaryFunction = auprcSummary,
                 classProbs = TRUE)

    set.seed(5627)
    orig_fit <- train(Class ~ .,
              data = toanalyze.train,
              method = "gbm",
              verbose = FALSE,
              metric = "AUPRC",
              trControl = ctrl)

这是我收到的错误信息：

     Error in order(scores.class0) : argument 1 is not a vector

是因为pr.curve()只接受数值向量作为输入（分数/概率）吗？

- P Barman

2个回答

1

我认为这种方法可以得到一个适当的自定义总结函数：

library(caret) 
library(pROC) 
library(PRROC)
library(mlbench) #for the data set

data(Ionosphere)

在 pr.curve 函数中，分类分数可以为每个类别的数据点单独提供，即对于正/前景类的数据点，提供 scores.class0，对于负/背景类的数据点，提供 scores.class1；或者为所有数据点提供分类分数，并将标签作为数字值提供（正类为 1，负类为 0），即使用 weights.class0（如果不清楚，请参考函数帮助）。

我选择后一种方法 - 在 scores.class0 中提供所有概率，而在 weights.class0 中提供类别分配。

caret 表明，如果 trainControl 对象的 classProbs 参数设置为 TRUE，则数据中将存在包含类概率的额外列。因此，在 Ionosphere 数据中应该存在 good 和 bad 两列。

levels(Ionosphere$Class)
#output
[1] "bad"  "good"

要将其转换为0/1标签，只需执行以下操作：

as.numeric(Ionosphere$Class) - 1

good 将变成 1
bad 将变成 0

现在我们已经拥有了自定义函数的所有数据

auprcSummary <- function(data, lev = NULL, model = NULL){
  prob_good <- data$good #take the probability of good class
  the_curve <- pr.curve(scores.class0 = prob_good,
                        weights.class0 = as.numeric(data$obs)-1, #provide the class labels as 0/1
                        curve = FALSE)
  out <- the_curve$auc.integral
  names(out) <- "AUPRC"
  out
}

不要使用 data$good，这只适用于此数据集，可以提取类名并使用它来获取所需的列：

  lvls <- levels(data$obs)
  prob_good <- data[,lvls[2]]

重要提示：每次更新summaryFunction时，您需要更新trainControl对象。

ctrl <- trainControl(method = "repeatedcv",
                     number = 10,
                     repeats = 5,
                     summaryFunction = auprcSummary,
                     classProbs = TRUE)

orig_fit <- train(y = Ionosphere$Class, x = Ionosphere[,c(1,3:34)], #omit column 2 to avoid a bunch of warnings related to the data set
                  method = "gbm",
                  verbose = FALSE,
                  metric = "AUPRC",
                  trControl = ctrl)

orig_fit$results
#output
  shrinkage interaction.depth n.minobsinnode n.trees     AUPRC    AUPRCSD
1       0.1                 1             10      50 0.9722775 0.03524882
4       0.1                 2             10      50 0.9758017 0.03143379
7       0.1                 3             10      50 0.9739880 0.03316923
2       0.1                 1             10     100 0.9786706 0.02502183
5       0.1                 2             10     100 0.9817447 0.02276883
8       0.1                 3             10     100 0.9772322 0.03301064
3       0.1                 1             10     150 0.9809693 0.02078601
6       0.1                 2             10     150 0.9824430 0.02284361
9       0.1                 3             10     150 0.9818318 0.02287886

似乎合理。

- missuse

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- topepo · Accepted Answer

caret内置了一个名为prSummary的函数，可以为您计算这个值。您不需要自己编写。