在现实生活中,我试图根据众多故障代码来预测车辆的健康状况。
有些故障码(因素的级别)出现得非常频繁(1000+),而其他的只出现两三次。通常情况下,那些很少出现的故障码是健康状态的“完美预测器”(0或1)。我试图找到一种可靠的统计方法来确定哪些因素级别是好的预测变量(显著的)。这样,那些很少出现但是是好的预测变量的故障码就不会仅仅因为它们的罕见性而被舍弃。
数据创建
library(tidyverse)
n_small = 4
n_big = 100
set.seed(567)
df_big_1 <- data.frame(class = rep("A", n_big),
health = rbinom(n = n_big, size = 1, prob = .4))
df_small_1 <- data.frame(class = rep("B", n_small),
health = rbinom(n = n_small, size = 1, prob = 1))
df_small_2 <- data.frame(class = rep("C", n_small),
health = rbinom(n = n_small, size = 1, prob = 1))
df_big_2 <- data.frame(class = rep("D", n_big),
health = rbinom(n = n_big, size = 1, prob = .4))
df_big_3 <- data.frame(class = rep("E", n_big),
health = rbinom(n = n_big, size = 1, prob = .4))
df_data <- rbind(df_big_1 ,df_small_1, df_big_2, df_small_2, df_big_3)
df_data <- df_data %>% mutate(class = factor(class))
数据检查
df_data %>%
group_by(class) %>%
summarise(N_health = sum(health), Mean = mean(health))
# A tibble: 5 × 3
class N_health Mean
<fct> <int> <dbl>
1 A 36 0.36
2 B 4 1
3 C 4 1
4 D 40 0.4
5 E 40 0.4
(二元)逻辑回归
在这个简化数据集上运行二元逻辑回归时,我未能获取罕见但“完美”的预测变量:
regmod_01 <- glm(health ~ class, family = binomial, data = df_data)
summary(regmod_01
Call:
glm(formula = health ~ class, family = binomial, data = df_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0108 -1.0108 -0.9448 1.3537 1.4294
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5754 0.2083 -2.762 0.00575 **
classB 17.1414 1199.7724 0.014 0.98860
classC 17.1414 1199.7724 0.014 0.98860
classD 0.1699 0.2917 0.583 0.56022
classE 0.1699 0.2917 0.583 0.56022
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 415.22 on 307 degrees of freedom
Residual deviance: 399.89 on 303 degrees of freedom
AIC: 409.89
Number of Fisher Scoring iterations: 15
我还有其他的方法可以尝试,来区分好的和坏的预测变量并且包含那些不经常出现的吗?