我正在使用R中的caret包来训练径向基SVM进行分类; 此外,线性SVM用于变量选择。 使用metric =“Accuracy”,这很好,但最终我更加关心优化metric =“ROC”。尽管计算了所有拟合模型的ROC,但似乎存在一些聚合ROC值的问题。
以下是一些示例代码:
library(caret)
library(mlbench)
set.seed(0)
data(Sonar)
x<-scale(Sonar[,1:60])
y<-as.factor(Sonar[,61])
# Custom summary function to use both
# defaultSummary() and twoClassSummary
# Also input and output of summary function are printed
svm.summary<-function(data, lev = NULL, model = NULL){
print(head(data,n=3))
a<-defaultSummary(data, lev, model)
b<-twoClassSummary(data, lev, model)
out<-c(a,b)
print(out)
out}
fitControl <- trainControl(
method = "cv",
number = 2,
classProbs = TRUE,
summaryFunction=svm.summary,
verbose=T,
allowParallel = FALSE)
# Ranking function: Rank Variables using a linear
# SVM
rankSVM<-function(object,x,y) {
print("ranking")
obj<-ksvm(x=as.matrix(x), y=y,
kernel=vanilladot,
kpar=list(), C=10,
scaled=F)
w<-t(obj@coef[[1]]%*%obj@xmatrix[[1]])
z<-abs(w)/sqrt(sum(w^2))
ord<-order(z,decreasing=T)
data.frame(var=dimnames(z)[[1]][ord],Overall=z[ord])
}
svmFuncs<-getModelInfo("svmRadial",regex=F)
svmFit<-function(x,y,first,last,...) {
out<-train(x=x,y=as.factor(y),
method="svmRadial",
trControl=fitControl,
scaled=F,
metric="Accuracy",
maximize=T,
returnData=T)
out$finalModel}
selectionFunctions<-list(summary=svm.summary,
fit=svmFit,
pred=svmFuncs$svmRadial$predict,
prob=svmFuncs$svmRadial$prob,
rank=rankSVM,
selectSize=pickSizeBest,
selectVar=pickVars)
selectionControl<-rfeControl(functions=selectionFunctions,
rerank=F,
verbose=T,
method="cv",
number=2)
subsets<-c(1,30,60)
svmProfile<-rfe(x=x,y=y,
sizes=subsets,
metric="Accuracy",
maximize=TRUE,
rfeControl=selectionControl)
svmProfile
最终输出如下:
> svmProfile
Recursive feature selection
Outer resampling method: Cross-Validated (2 fold)
Resampling performance over subset size:
Variables Accuracy Kappa ROC Sens Spec AccuracySD KappaSD ROCSD SensSD SpecSD Selected
1 0.8075 0.6122 NaN 0.8292 0.7825 0.02981 0.06505 NA 0.06153 0.1344 *
30 0.8028 0.6033 NaN 0.8205 0.7825 0.00948 0.02533 NA 0.09964 0.1344
60 0.8028 0.6032 NaN 0.8206 0.7823 0.00948 0.02679 NA 0.12512 0.1635
The top 1 variables (out of 1):
V49
ROC 为 NaN。检查输出(verbose=T,且 summary 函数已打补丁以显示其输出和部分输入),在内部循环中调整 SVM 时,似乎正确计算了 ROC:
+ Fold1: sigma=0.01172, C=0.25
pred obs M R
1 M R 0.6658878 0.3341122
2 M R 0.5679477 0.4320523
3 R R 0.2263576 0.7736424
Accuracy Kappa ROC Sens Spec
0.6730769 0.3480826 0.7961310 0.6428571 0.7083333
- Fold1: sigma=0.01172, C=0.25
+ Fold1: sigma=0.01172, C=0.50
pred obs M R
1 M R 0.7841249 0.2158751
2 M R 0.7231365 0.2768635
3 R R 0.3033492 0.6966508
Accuracy Kappa ROC Sens Spec
0.7692308 0.5214724 0.8407738 0.9642857 0.5416667
- Fold1: sigma=0.01172, C=0.50
[...]
外部迭代似乎出现了问题。在两个折叠之间,我们得到以下内容:
-(rfe) fit Fold1 size: 1
pred obs Variables
1 M R 1
2 M R 1
3 M R 1
Accuracy Kappa ROC Sens Spec
0.7864078 0.5662328 NA 0.8727273 0.6875000
pred obs Variables
1 R R 30
2 M R 30
3 M R 30
Accuracy Kappa ROC Sens Spec
0.7961165 0.5853939 NA 0.8909091 0.6875000
pred obs Variables
1 R R 60
2 M R 60
3 M R 60
Accuracy Kappa ROC Sens Spec
0.7961165 0.5842783 NA 0.9090909 0.6666667
+(rfe) fit Fold2 size: 60
看起来,这里汇总函数的输入矩阵并不包含类概率,而是变量的数量,因此ROC不能正确地计算/聚合。有人知道如何解决这个问题吗?我是否忘记在某个地方告诉caret输出类概率了?
非常感谢您的帮助,因为caret是一个真正很酷的软件包,如果能正确运行,将为我节省大量的工作。
Thoralf
[text](url)
的链接格式时,必须包含http://
部分,否则链接将无法工作。我已经为你修复了这个问题,以后请记住这一点。 - Pokechu22