在R中模型矩阵中的因子的所有级别

Question

在R中模型矩阵中的因子的所有级别

76

我有一个 data.frame，其中包含数字和因子变量，如下所示。

testFrame <- data.frame(First=sample(1:10, 20, replace=T),
           Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
           Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
           Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))

我希望建立一个矩阵，将虚拟变量分配给因子并保留数值变量。

model.matrix(~ First + Second + Third + Fourth + Fifth, data=testFrame)

正如预期的那样，当运行lm时，它会将每个因子的一个级别作为参考水平留下。然而，我想建立一个matrix，其中包含所有因子级别的虚拟/指示变量。我正在为glmnet构建此矩阵，因此我不担心多重共线性问题。

是否有一种方法可以让model.matrix为每个因子的每个级别创建虚拟变量？

- Jared

11个回答

55

你需要重置因子变量的 contrasts：

model.matrix(~ Fourth + Fifth, data=testFrame, 
        contrasts.arg=list(Fourth=contrasts(testFrame$Fourth, contrasts=F), 
                Fifth=contrasts(testFrame$Fifth, contrasts=F)))

或者，少打一点字，但不使用正确的名称：

model.matrix(~ Fourth + Fifth, data=testFrame, 
    contrasts.arg=list(Fourth=diag(nlevels(testFrame$Fourth)), 
            Fifth=diag(nlevels(testFrame$Fifth))))

- fabians

14

完全有效，我会接受那个答案，但是如果我要输入20个因子，是否有一种通用方法可以应用于框架中的所有变量，或者我注定要输入太多字了？ - Jared

20

caret 实现了一个很好的函数 dummyVars，只需两行代码即可实现：

library(caret) dmy <- dummyVars(" ~ .", data = testFrame) testFrame2 <- data.frame(predict(dmy, newdata = testFrame))

检查最终的列：

colnames(testFrame2)

"First"  "Second"         "Third"          "Fourth.Alice"   "Fourth.Bob"     "Fourth.Charlie" "Fourth.David"   "Fifth.Edward"   "Fifth.Frank"   "Fifth.Georgia"  "Fifth.Hank"     "Fifth.Isaac"

这里最好的一点是你可以得到原始数据框架，以及排除用于转换的原始变量后生成的虚拟变量。

更多信息： http://amunategui.github.io/dummyVar-Walkthrough/

- Pablo Casas

12

caret 的 dummyVars 也可以被使用。详见：http://caret.r-forge.r-project.org/preprocess.html。

- Sagar Jauhari

看起来不错，但是没有包括截距，而且我似乎无法强制它加入。 - Jared

2

@jared：这对我有效。示例：

require(caret); (df <- data.frame(x1=c('a','b'), x2=1:2)); dummies <- dummyVars(x2~ ., data = df); predict(dummies, newdata = df)

- Andrew

1

@Jared，当你有一个因子的每个水平的虚拟变量时，就不需要截距了。 - Will Townes

1

@Jared：这是添加拦截列的代码：

require(caret); (df <- data.frame(x1=c('a','b'), x2=1:2)); dummies <- dummyVars(x2~ ., data = df); predict(dummies, newdata = df); cbind(1, predict(dummies, newdata = df))

- MYaseen208

3

好的，只需阅读上面所述并将所有内容放在一起。假设您想要矩阵，例如 'X.factors'，该矩阵乘以您的系数向量以获得线性预测器。仍然有几个额外步骤：

X.factors = 
  model.matrix( ~ ., data=X, contrasts.arg = 
    lapply(data.frame(X[,sapply(data.frame(X), is.factor)]),
                                             contrasts, contrasts = FALSE))

（注意，如果你只有一个因子列，需要将X[*]转换回数据框。）

然后，假设你得到了这样的结果：

attr(X.factors,"assign")
[1]  0  1  **2**  2  **3**  3  3  **4**  4  4  5  6  7  8  9 10 #emphasis added

我们希望消除每个因素的"d参考水平。"

att = attr(X.factors,"assign")
factor.columns = unique(att[duplicated(att)])
unwanted.columns = match(factor.columns,att)
X.factors = X.factors[,-unwanted.columns]
X.factors = (data.matrix(X.factors))

- user36302

1

顺便问一下，为什么这不是基本的 R 语言功能？每次运行模拟仿真时都需要用到它。 - user36302

3

一个 tidyverse 的解答：

library(dplyr)
library(tidyr)
result <- testFrame %>% 
    mutate(one = 1) %>% spread(Fourth, one, fill = 0, sep = "") %>% 
    mutate(one = 1) %>% spread(Fifth, one, fill = 0, sep = "")

产生所需的结果（与@Gavin Simpson的答案相同）：

> head(result, 6)
  First Second Third FourthAlice FourthBob FourthCharlie FourthDavid FifthEdward FifthFrank FifthGeorgia FifthHank FifthIsaac
1     1      5     4           0         0             1           0           0          1            0         0          0
2     1     14    10           0         0             0           1           0          0            1         0          0
3     2      2     9           0         1             0           0           1          0            0         0          0
4     2      5     4           0         0             0           1           0          1            0         0          0
5     2     13     5           0         0             1           0           1          0            0         0          0
6     2     15     7           1         0             0           0           1          0            0         0          0

- shosaco

2

使用 R 包 'CatEncoders'

library(CatEncoders)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
           Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
           Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
           Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))

fit <- OneHotEncoder.fit(testFrame)

z <- transform(fit,testFrame,sparse=TRUE) # give the sparse output
z <- transform(fit,testFrame,sparse=FALSE) # give the dense output

- asdf123

2

我写了一个名为ModelMatrixModel的包，旨在改进model.matrix()的功能。该包中的ModelMatrixModel()函数默认返回一个包含所有虚拟变量水平的稀疏矩阵类，适用于glmnet包中的cv.glmnet()输入。重要的是，返回的类还存储转换参数，如因子水平信息，这些参数可以应用于新数据。该函数可以处理r公式中的大多数项目，如poly()和interaction。它还提供了几个其他选项，如处理无效因子级别和缩放输出。

#devtools::install_github("xinyongtian/R_ModelMatrixModel")
library(ModelMatrixModel)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
                        Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
                        Fourth=rep(c("Alice","Bob","Charlie","David"), 5))
newdata=data.frame(First=sample(1:10, 2, replace=T),
                   Second=sample(1:20, 2, replace=T), Third=sample(1:10, 2, replace=T),
                   Fourth=c("Bob","Charlie"))
mm=ModelMatrixModel(~First+Second+Fourth, data = testFrame)
class(mm)
## [1] "ModelMatrixModel"
class(mm$x) #default output is sparse matrix
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
data.frame(as.matrix(head(mm$x,2)))
##   First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1     7     17           1         0             0           0
## 2     9      7           0         1             0           0

#apply the same transformation to new data, note the dummy variables for 'Fourth' includes the levels not appearing in new data     
mm_new=predict(mm,newdata)
data.frame(as.matrix(head(mm_new$x,2))) 
##   First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1     6      3           0         1             0           0
## 2     2     12           0         0             1           0

- Ben2018

2

model.matrix(~ First + Second + Third + Fourth + Fifth - 1, data=testFrame)

或者

model.matrix(~ First + Second + Third + Fourth + Fifth + 0, data=testFrame)

应该是最直接的。

- Federico Rotolo

如果只有一个因素，这将很有效，但如果有多个因素，则仍会省略参考水平。 - Gregor Thomas

2

我目前正在学习拉索模型以及glmnet::cv.glmnet()、model.matrix()和Matrix::sparse.model.matrix()（对于高维矩阵，使用model.matrix会很耗时间，正如glmnet的作者建议的那样）。

只是分享一下，有一个整洁的编码方法可以得到与@fabians和@Gavin答案相同的结果。同时，@asdf123还介绍了另一个包library('CatEncoders')。

> require('useful')
> # always use all levels
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = FALSE)
> 
> # just use all levels for Fourth
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = c(Fourth = FALSE, Fifth = TRUE))

来源: R语言大数据分析与可视化实战 (第273页)

- Rγσ ξηg Lιαη Ημ 雷欧

感谢您的回答。有趣的是，build.x函数是我编写的，并且得益于@fabiens和@gavin的答案！而这是我的书！很酷这一切都成了一个完整的循环。感谢您的阅读！ - Jared

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Gavin Simpson · Accepted Answer

(试图挽回局面...) 针对Jared在@Fabians答案中有关自动化的评论，需要注意的是你只需要提供一个命名列表的对比矩阵即可。 contrasts()接受一个向量/因子并从中生成对比矩阵。因此，我们可以使用lapply()在数据集中运行contrasts()，例如对于提供的testFrame示例：

> lapply(testFrame[,4:5], contrasts, contrasts = FALSE)
$Fourth
        Alice Bob Charlie David
Alice       1   0       0     0
Bob         0   1       0     0
Charlie     0   0       1     0
David       0   0       0     1

$Fifth
        Edward Frank Georgia Hank Isaac
Edward       1     0       0    0     0
Frank        0     1       0    0     0
Georgia      0     0       1    0     0
Hank         0     0       0    1     0
Isaac        0     0       0    0     1

这个很好地契合了@fabians的答案：

model.matrix(~ ., data=testFrame, 
             contrasts.arg = lapply(testFrame[,4:5], contrasts, contrasts=FALSE))