如何计算rpart树的敏感性和特异性

4
library(rpart)
train <- data.frame(ClaimID = c(1,2,3,4,5,6,7,8,9,10),
                    RearEnd = c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE),
                    Whiplash = c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE),
                    Activity = factor(c("active", "very active", "very active", "inactive", "very inactive", "inactive", "very inactive", "active", "active", "very active"),
                                      levels=c("very inactive", "inactive", "active", "very active"),
                                      ordered=TRUE),
                    Fraud = c(FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE))
mytree <- rpart(Fraud ~ RearEnd + Whiplash + Activity, data = train, method = "class", minsplit = 2, minbucket = 1, cp=-1)
prp(mytree, type = 4, extra = 101, leaf.round = 0, fallen.leaves = TRUE, 
    varlen = 0, tweak = 1.2)

enter image description here

然后通过使用printcp我可以查看交叉验证结果。

> printcp(mytree)

Classification tree:
rpart(formula = Fraud ~ RearEnd + Whiplash + Activity, data = train, 
    method = "class", minsplit = 2, minbucket = 1, cp = -1)

Variables actually used in tree construction:
[1] Activity RearEnd  Whiplash

Root node error: 5/10 = 0.5

n= 10 

    CP nsplit rel error xerror xstd
1  0.6      0       1.0    2.0  0.0
2  0.2      1       0.4    0.4  0.3
3 -1.0      3       0.0    0.4  0.3

因此,根节点误差为0.5,从我的理解来看,这是分类错误率。但我在计算灵敏度(真正例的比例)和特异性(真负例的比例)方面遇到了麻烦。我如何基于rpart输出计算这些值?
(上面的示例来自http://gormanalysis.com/decision-trees-in-r-using-rpart/
2个回答

2
你可以使用 caret 包来实现这一点:
数据:
library(rpart)
train <- data.frame(ClaimID = c(1,2,3,4,5,6,7,8,9,10),
                    RearEnd = c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE),
                    Whiplash = c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE),
                    Activity = factor(c("active", "very active", "very active", "inactive", "very inactive", "inactive", "very inactive", "active", "active", "very active"),
                                      levels=c("very inactive", "inactive", "active", "very active"),
                                      ordered=TRUE),
                    Fraud = c(FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE))
mytree <- rpart(Fraud ~ RearEnd + Whiplash + Activity, data = train, method = "class", minsplit = 2, minbucket = 1, cp=-1)

解决方案

library(caret)

#calculate predictions
preds <- predict(mytree, train)

#calculate sensitivity
> sensitivity(factor(preds[,2]), factor(as.numeric(train$Fraud)))
[1] 1

#calculate specificity
> specificity(factor(preds[,2]), factor(as.numeric(train$Fraud)))
[1] 1

敏感度(sensitivity)和特异度(specificity)都将预测结果作为第一参数,观察值(响应变量,即train$Fraud)作为第二参数。

根据文档,预测结果和观察值都需要作为具有相同级别的因子输入到函数中。

在这种情况下,由于预测结果100%准确,敏感度和特异度均为1。


谢谢。但是如果预测是100%准确的,为什么根节点误差是0.5? - Adrian
不客气。我不知道根节点误差是如何计算的,但我认为它也取决于交叉验证。根节点错误不是准确度指标。这个(简单)集合只有10个观察值,准确度为100%。 - LyzandeR

0

根节点误差是树的根部的错误分类率。因此,它是在添加任何节点之前的错误分类率,而不是最终树的错误分类率。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接