混淆矩阵错误: 数据和参考因素必须具有相同数量的级别。

Question

混淆矩阵错误: 数据和参考因素必须具有相同数量的级别。

rmachine-learningartificial-intelligenceclassificationlinear-regression

44

我用R caret训练了一个线性回归模型。现在我正在尝试生成混淆矩阵，但不断收到以下错误信息：

Error in confusionMatrix.default(pred, testing$Final) : 数据和参考因素必须具有相同数量的级别

EnglishMarks <- read.csv("E:/Subject Wise Data/EnglishMarks.csv", 
header=TRUE)
inTrain<-createDataPartition(y=EnglishMarks$Final,p=0.7,list=FALSE)
training<-EnglishMarks[inTrain,]
testing<-EnglishMarks[-inTrain,]
predictionsTree <- predict(treeFit, testdata)
confusionMatrix(predictionsTree, testdata$catgeory)
modFit<-train(Final~UT1+UT2+HalfYearly+UT3+UT4,method="lm",data=training)
pred<-format(round(predict(modFit,testing)))              
confusionMatrix(pred,testing$Final)

在生成混淆矩阵时出现错误。这两个对象的级别是相同的。我无法弄清楚问题出在哪里。它们的结构和级别如下所示。它们应该是一样的。任何帮助都将非常感激，因为这让我崩溃了！！

> str(pred)
chr [1:148] "85" "84" "87" "65" "88" "84" "82" "84" "65" "78" "78" "88" "85"  
"86" "77" ...
> str(testing$Final)
int [1:148] 88 85 86 70 85 85 79 85 62 77 ...

> levels(pred)
NULL
> levels(testing$Final)
NULL

- abcd

1

线索就在您的str输出中。看看它们有什么不同？pred是字符类，testing$Final是整数类。当您在此处调用format时pred<-format(round(predict(modFit,testing)))，它会将其转换为字符格式，因为当提供列表时它会这样做。为什么要进行格式化？您应该计算模型的RMSE或MAE，请查看https://heuristically.wordpress.com/2013/07/12/calculate-rmse-and-mae-in-r-and-sas/ - infominer

2

@infominer 现在我使用pred <- as.integer(format(round(predict(modFit,testing)))) 命令将char结果转换为int，但仍然出现与之前相同的错误。我不知道哪里出错了。 - abcd

8个回答

15

confusionMatrix(pred,testing$Final)

无论何时尝试构建混淆矩阵时，请确保真实值和预测值都是因子数据类型。在这里，pred和testing$Final都必须是factor类型。不要检查levels，而是检查两个变量的类型，并将它们转换为factor类型（如果它们不是）。在这里，testing$final是int类型。先将其转换为factor类型，然后构建混淆矩阵。

- sandeep patil

13

运行table(pred)和table(testing$Final)命令，你会发现测试集中至少有一个数字从未被预测到（即在pred中从未出现）。这就是所谓的“不同水平”的含义。这里有一个自定义的函数示例，可解决此问题（链接）。

但是，我发现这个技巧也能行：

table(factor(pred, levels=min(test):max(test)), 
      factor(test, levels=min(test):max(test)))

它应该给你与该函数相同的混淆矩阵。

- nayriz

6

像下面这样的东西对我来说似乎有效。这个想法与 @nayriz 的相似:

confusionMatrix(
  factor(pred, levels = 1:148),
  factor(testing$Final, levels = 1:148)
)

关键是确保因子水平匹配。

- David C.

4

在类似的错误处理中，我强制GLM预测结果与目标变量的分类相同。

例如，GLM将预测一个“数字”类别。但是如果目标变量是一个“因子”类别，就会发生错误。

错误的代码：

#Predicting using logistic model
glm.probs = predict(model_glm, newdata = test, type = "response")
test$pred_glm = ifelse(glm.probs > 0.5, "1", "0")


#Checking the accuracy of the logistic model
    confusionMatrix(test$default,test$pred_glm)

结果：

Error: `data` and `reference` should be factors with the same levels.

更正后的代码：

#Predicting using logistic model
    glm.probs = predict(model_glm, newdata = test, type = "response")
    test$pred_glm = ifelse(glm.probs > 0.5, "1", "0")
    test$pred_glm = as.factor(test$pred_glm)
    
#Checking the accuracy of the logistic model
confusionMatrix(test$default,test$pred_glm)

结果：

confusion Matrix and Statistics

          Reference
Prediction     0     1
         0   182  1317
         1   122 22335
                                          
               Accuracy : 0.9399          
                 95% CI : (0.9368, 0.9429)
    No Information Rate : 0.9873          
    P-Value [Acc > NIR] : 1

- Jeremiah Osibe

0

创建混淆矩阵时，我们会遇到这个错误。在创建混淆矩阵时，我们需要确保数据类型的预测值和实际值是“因子”。如果存在其他数据类型，则必须在生成混淆矩阵之前将它们转换为“因子”数据类型。完成此转换后，开始编译混淆矩阵。

pridicted <- factor(predict(treeFit, testdata))
real <- factor(testdata$catgeory)
my_data1 <- data.frame(data = pridicted, type = "prediction")
my_data2 <- data.frame(data = real, type = "real"
my_data3 <- rbind(my_data1,my_data2)
# Check if the levels are identical
identical(levels(my_data3[my_data3$type == "prediction",1]) , 
levels(my_data3[my_data3$type == "real",1]))
confusionMatrix(my_data3[my_data3$type == "prediction",1], 
my_data3[my_data3$type == "real",1],  dnn = c("Prediction", "Reference"))

- zaid

当链接到您自己的网站或内容（或与您有关联的内容）时，您必须在答案中披露您的关联，以便不被视为垃圾邮件。在用户名中具有与URL相同的文本或在个人资料中提及它不被视为足够的披露根据Stack Exchange政策。 - cigien

0

我遇到了数据集中目标变量存在NAs的问题。如果你正在使用tidyverse，你可以使用drop_na函数来删除包含NAs的行。像这样：

iris %>% drop_na(Species) # Removes rows where Species column has NA
iris %>% drop_na() # Removes rows where any column has NA

对于基本的R语言，它可能看起来像这样：

iris[! is.na(iris$Species), ] # Removes rows where Species column has NA
na.omit(iris) # Removes rows where any column has NA

- titaniumtroop

-4

您正在使用回归并尝试生成混淆矩阵。我相信混淆矩阵是用于分类任务的。通常人们使用R^2和RMSE指标。

- user4959

2

回归也可以用于分类任务。 - Pedram

只要它有两个类，就返回已翻译的文本。 - GaB

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Cenk ŞİMŞEK · Accepted Answer

我遇到了同样的问题。我猜原因是数据参数没有像我想象的那样转换为因子（factor）类型。尝试：

confusionMatrix(pred,as.factor(testing$Final))

希望它能帮助到你