从回归估计中提取交互项

8
这是一个简单的问题,但我找不到任何清晰而令人信服的答案。如果我有一个带有一个或多个交互项的回归模型,例如:
mod1 <- lm(mpg ~ factor(cyl) * factor(am), data = mtcars)
coef(summary(mod1))
##                           Estimate Std. Error   t value              Pr(>|t|)
## (Intercept)              22.900000   1.750674 13.080673 0.0000000000006057324
## factor(cyl)6             -3.775000   2.315925 -1.630018 0.1151545663620229670
## factor(cyl)8             -7.850000   1.957314 -4.010599 0.0004547582690011110
## factor(am)1               5.175000   2.052848  2.520888 0.0181760532676256310
## factor(cyl)6:factor(am)1 -3.733333   3.094784 -1.206331 0.2385525615801434851
## factor(cyl)8:factor(am)1 -4.825000   3.094784 -1.559075 0.1310692573417492068

有没有一种确定交互项系数估计的绝对方法?显而易见的方法是在术语名称中使用grep()查找冒号符号。但让我们假设由于某种原因无法这样做:

mtcars$cyl2 <- factor(mtcars$cyl, levels = c(4,6,8), labels = paste("Cyl:", unique(mtcars$cyl)))
mod2 <- lm(mpg ~ cyl2 * factor(am), data = mtcars)
##                         Estimate Std. Error   t value              Pr(>|t|)
## (Intercept)            22.900000   1.750674 13.080673 0.0000000000006057324
## cyl2Cyl: 4             -3.775000   2.315925 -1.630018 0.1151545663620229670
## cyl2Cyl: 8             -7.850000   1.957314 -4.010599 0.0004547582690011110
## factor(am)1             5.175000   2.052848  2.520888 0.0181760532676256310
## cyl2Cyl: 4:factor(am)1 -3.733333   3.094784 -1.206331 0.2385525615801434851
## cyl2Cyl: 8:factor(am)1 -4.825000   3.094784 -1.559075 0.1310692573417492068

我认为terms()对象可能有用,但事实并非如此。我也可以根据术语的排序/编号做出一些假设,以获得期望的结果:

coef(summary(mod2))[5:6,]
##                         Estimate Std. Error   t value  Pr(>|t|)
## cyl2Cyl: 4:factor(am)1 -3.733333   3.094784 -1.206331 0.2385526
## cyl2Cyl: 8:factor(am)1 -4.825000   3.094784 -1.559075 0.1310693

但我不知道如何以一般的方式做到这一点。 有什么可以做的吗?
6个回答

4
这似乎有些复杂,但我们能不能列出所有主要影响,然后取集合差异呢?
mod2 <- lm(mpg ~ cyl2 * factor(am) + wt * disp, data = mtcars)
variables <- labels(mod2)[attr(terms(mod2), "order") == 1]
factors <- sapply(names(mod2$xlevels), function(x) paste0(x, mod2$xlevels[[x]])[-1])
setdiff(colnames(model.matrix(mod2)), c("(Intercept)", variables, unlist(factors)))
# [1] "cyl2Cyl: 4:factor(am)1" "cyl2Cyl: 8:factor(am)1" "wt:disp" 

3
一种可能的一行代码解决方案是:
coef(summary(mod2))[1L + which(colSums(attr(terms(mod2), "factors")[,attr(model.matrix(mod2), "assign")[-1L]]) == 2L),]
##                         Estimate Std. Error   t value  Pr(>|t|)
## cyl2Cyl: 4:factor(am)1 -3.733333   3.094784 -1.206331 0.2385526
## cyl2Cyl: 8:factor(am)1 -4.825000   3.094784 -1.559075 0.1310693

这个代码看起来有点凌乱,但基本上它是在使用模型矩阵的"assign"属性:

attr(model.matrix(mod2), "assign")
## [1] 0 1 1 2 3 3

获取交互项系数的名称索引(off-by-one):

which(colSums(attr(terms(mod2), "factors")[,attr(model.matrix(mod2), "assign")[-1L]]) == 2L)
## cyl2:factor(am) cyl2:factor(am) 
##               4               5

可以使用coef()从中提取相关行。这是因为系数的顺序由设计矩阵中的列顺序和"assign"属性所规定,该属性将这些列映射回原始公式中的项。

虽然不是非常干净,但似乎有效。


*PS. 在此示例中,仅提取二阶交互项。更改== 2L>= 2L或其他内容可允许获取更高阶交互作用。


这不支持多个交互作用,例如 mod2 <- lm(mpg ~ cyl2*factor(am) + factor(gear):factor(vs) + cut(wt, 2)*factor(carb), mtcars),尽管我还没有追踪原因。 - alistaire

3
我在coefplot中编写了一个可以实现这个功能的工具。

#' @title matchCoefs.default
#' @description Match coefficients to predictors
#' @details Matches coefficients to predictors using information from model matrices
#' @author Jared P. Lander
#' @aliases matchCoefs.default
#' @import reshape2
#' @param model Fitted model
#' @param \dots Further arguments
#' @return a data.frame matching predictors to coefficients
matchCoefs.default <- function(model, ...)
{
    # get the terms
    theTerms <- model$terms
    # get the assignment position
    #thePos <- model$assign
    thePos <- get.assign(model)
    # get intercept indicator
    inter <- attr(theTerms, "intercept")
    # get coef names
    coefNames <- names(stats::coef(model))
    # get pred names
    predNames <- attr(theTerms, "term.labels")
    # expand out pred names to match coefficient names
    predNames <- predNames[thePos]
    # if there's an intercept term add it to the pred names
    if(inter == 1)
    {
        predNames <- c("(Intercept)", predNames)
    }

    # build data.frame linking term to coefficient name
    matching <- data.frame(Term=predNames, Coefficient=coefNames, stringsAsFactors=FALSE)

    ## now match individual predictor to term
    # get matrix as data.frame
    factorMat <- as.data.frame(attr(theTerms, "factor"))
    # add column from rownames as identifier
    factorMat$.Pred <- rownames(factorMat)
    factorMat$.Type <- attr(theTerms, "dataClasses")

    # melt it down for comparison
    factorMelt <- melt(factorMat, id.vars=c(".Pred", ".Type"), variable.name="Term")
    factorMelt$Term <- as.character(factorMelt$Term)

    # only keep rows where there's a match
    factorMelt <- factorMelt[factorMelt$value == 1, ]

    # again, bring in coefficient if needed
    if(inter == 1)
    {
        factorMelt <- rbind(data.frame(.Pred="(Intercept)", .Type="(Intercept)", Term="(Intercept)", value=1, stringsAsFactors=FALSE), factorMelt)
    }

    # join into the matching data.frame
    matching <- join(matching, factorMelt, by="Term")

    return(matching)
}

然后您可以调用matchCoefs.default(mod1)


1
我会选择一些整洁的东西:
library(tidyverse)
library(broom)
mtcars$cyl2 <- factor(mtcars$cyl, levels = c(4,6,8), labels = paste("Cyl:", unique(mtcars$cyl)))
mod2 <- lm(mpg ~ cyl2 * factor(am), data = mtcars)
mod2 %>% tidy %>% as_tibble %>% filter(term %>% str_detect("\\w{1}:\\w{1}")) %>% pull(estimate)
[1] -3.733333 -4.825000

更新

如果在冒号后面没有空格,上述方法会失败,可以通过以下方式解决:

# Look for factor variables and replace ':' with '_'
clean_colons = function(x){
  x %>% as_tibble %>%
    mutate_if(is.factor, function(x){str_replace_all(x,":","_")}) %>% 
    return()
}

# Define the tricky cyl2 varible
mtcars$cyl2 = factor(mtcars$cyl, levels = c(4,6,8),
                     labels = paste0("Cyl:", unique(mtcars$cyl)))

# Create the model and output parameters
mtcars %>% clean_colons %>% lm(mpg ~ cyl2 * factor(am), data = .) %>% tidy %>%
  as_tibble
# A tibble: 6 x 5
  term                  estimate std.error statistic  p.value
  <chr>                    <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)              19.1       1.52    12.6   1.38e-12
2 cyl2Cyl_6                 3.77      2.32     1.63  1.15e- 1
3 cyl2Cyl_8                -4.08      1.75    -2.33  2.80e- 2
4 factor(am)1               1.44      2.32     0.623 5.39e- 1
5 cyl2Cyl_6:factor(am)1     3.73      3.09     1.21  2.39e- 1
6 cyl2Cyl_8:factor(am)1    -1.09      3.28    -0.333 7.42e- 1

# Now we no longer have the colon-in-variable-name problem, so we can do
mtcars %>% clean_colons %>% lm(mpg ~ cyl2 * factor(am), data = .) %>% tidy %>%
  as_tibble %>% filter(term %>% str_detect("\\w{1}:\\w{1}")) %>% pull(estimate)
[1]  3.733333 -1.091667

如果因子标签中没有空格(例如,“Cyl:4”),会怎样? - Thomas

1
这也很复杂,但至少你可以明确地追踪自己正在做什么:
# require package
require(Hmisc)

# load and process data
data(mtcars)
mtcars$cyl <- factor(mtcars$cyl)
mtcars$gear <- factor(mtcars$gear)

# fit model 
m <- lm(mpg ~ hp*wt + disp + cyl*gear, data=mtcars)

# extract all variables used in right hand side of the model 
variables <- as.character(attr(terms(m), "variables"))[-1L] 

# extract model coefficients 
s <- summary(m)$coeff
s

# discard unwanted rows corresponding to continuous predictors   
s1 <- s[rownames(s) %nin% variables[-1], ]
s1

# discard unwanted rows corresponding to categorical predictor gear
s2 <- s1[rownames(s1) %nin% paste0("gear",levels(mtcars$gear)), ]
s2

输出将如下所示:

> s
            Estimate  Std. Error    t value     Pr(>|t|)
(Intercept) 43.80219399 5.977952455  7.3272905 4.404820e-07
hp          -0.11303405 0.040769042 -2.7725462 1.174817e-02
wt          -6.76725414 1.868428113 -3.6218970 1.699630e-03
disp        -0.00332244 0.014012039 -0.2371133 8.149808e-01
cyl6         2.87780626 3.180169670  0.9049222 3.762789e-01
cyl8         2.76416050 3.605957805  0.7665538 4.523007e-01
gear4        3.66877217 2.529082701  1.4506335 1.623857e-01
gear5        4.25400864 2.931068582  1.4513508 1.621881e-01
hp:wt        0.02401629 0.009953734  2.4127921 2.555079e-02
cyl6:gear4  -4.65994590 3.331261825 -1.3988531 1.771739e-01
cyl6:gear5  -3.86789967 4.400309909 -0.8790062 3.898369e-01
cyl8:gear5  -2.08842224 3.980538289 -0.5246582 6.055881e-01

> s1
           Estimate  Std. Error    t value     Pr(>|t|)
(Intercept) 43.80219399 5.977952455  7.3272905 4.404820e-07
cyl6         2.87780626 3.180169670  0.9049222 3.762789e-01
cyl8         2.76416050 3.605957805  0.7665538 4.523007e-01
gear4        3.66877217 2.529082701  1.4506335 1.623857e-01
gear5        4.25400864 2.931068582  1.4513508 1.621881e-01
hp:wt        0.02401629 0.009953734  2.4127921 2.555079e-02
cyl6:gear4  -4.65994590 3.331261825 -1.3988531 1.771739e-01
cyl6:gear5  -3.86789967 4.400309909 -0.8790062 3.898369e-01
cyl8:gear5  -2.08842224 3.980538289 -0.5246582 6.055881e-01

> s2
               Estimate  Std. Error    t value     Pr(>|t|)
(Intercept) 43.80219399 5.977952455  7.3272905 4.404820e-07
cyl6         2.87780626 3.180169670  0.9049222 3.762789e-01
cyl8         2.76416050 3.605957805  0.7665538 4.523007e-01
hp:wt        0.02401629 0.009953734  2.4127921 2.555079e-02
cyl6:gear4  -4.65994590 3.331261825 -1.3988531 1.771739e-01
cyl6:gear5  -3.86789967 4.400309909 -0.8790062 3.898369e-01
cyl8:gear5  -2.08842224 3.980538289 -0.5246582 6.055881e-01

可以循环遍历所有分类预测器以自动执行最后两个步骤。


1
另一种方法(也有点复杂)是通过协变量名称并查看其他协变量名称出现的次数来进行:
mtcars$cyl2 <- factor(mtcars$cyl, levels = c(4,6,8), labels = paste("Cyl:", unique(mtcars$cyl)))
mod2 <- lm(mpg ~ cyl2 * factor(am), data = mtcars)
coef(summary(mod2))
#>                         Estimate Std. Error   t value     Pr(>|t|)
#> (Intercept)            22.900000   1.750674 13.080673 6.057324e-13
#> cyl2Cyl: 4             -3.775000   2.315925 -1.630018 1.151546e-01
#> cyl2Cyl: 8             -7.850000   1.957314 -4.010599 4.547583e-04
#> factor(am)1             5.175000   2.052848  2.520888 1.817605e-02
#> cyl2Cyl: 4:factor(am)1 -3.733333   3.094784 -1.206331 2.385526e-01
#> cyl2Cyl: 8:factor(am)1 -4.825000   3.094784 -1.559075 1.310693e-01

# Essentially an ugly double-loop
check.inter <- function(coefs){
  which(lapply(coefs, function(co){
         sum(unlist(lapply(coefs, function(co2){
                    grepl(co2, co, fixed = T)})))})>1)}

check.inter(names(coefficients(mod2)))
#> [1] 5 6

coef(summary(mod2))[check.inter(names(coefficients(mod2))),]
#>                         Estimate Std. Error   t value  Pr(>|t|)
#> cyl2Cyl: 4:factor(am)1 -3.733333   3.094784 -1.206331 0.2385526
#> cyl2Cyl: 8:factor(am)1 -4.825000   3.094784 -1.559075 0.1310693

这段内容是由reprex package(v0.2.0)创建于2018年5月17日。


很不错!但是如果用户只指定交互项而没有主效应的情况呢,例如 mod2 <- lm(mpg ~ cyl2 : factor(am), data = mtcars) - Weihuang Wong
2
这是个好观点!我没有考虑到那些边缘情况。在我想出解决方案之前,让我们说这是一个只与健全的统计实践相兼容的答案 ;) - gfgm

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接