如何对因子水平进行分组?

7

我有一个因素列,其中包含足球位置缩写,大约有17个唯一值和220个观测值。我希望只有三个因素水平,包含这17个唯一的值。

levels(nfldraft$Pos) <- list(Linemen = c("C","OG","OT","TE","DT","DE"),
                             Small_Backs =  c("CB","WR","FS"), 
                             Big_Backs = c("FB","ILB","OLB","P","QB",
                                           "RB","SS","WR"))

我尝试了以下内容,将nfldraft$Pos打印到控制台上会显示3个因素水平,但所有的值都是"Linemen""Small_Backs",而其他所有值都是NA。我错在哪里了?


请提供一个可重现的示例和期望的输出。 - akrun
3
因子水平只能是一个一维向量,不能是列表。 - alistaire
WR分为两个类别。 - IRTFM
1
没有可重现的示例,很难说出错了什么,但是42的猜测是一个好答案。@alistaire,可以使用列表重新分配因子水平。事实上,这样做可能相当有效。详见http://stackoverflow.com/documentation/r/1104/factors/6565/consolidating-factor-levels-with-a-list#t=201608132159126191797 - Benjamin
@alistaire 请查看?levelslevels只能是一个向量,但是levels<-可以接受RHS上的list - MichaelChirico
3个回答

5

我准备了一个包含所有缩写的示例字符向量:

my_example <- c("C","OG","OT","TE","DT","DE","CB","WR","FS", 
                "FB","ILB","OLB","P","QB","RB","SS","WR")
class(my_example)

[1] "字符"

然后我用所需级别的全称替换了它们的缩写(你也可以在此处使用gsub或其他许多不同的方法):

my_example[my_example %in% c("C","OG","OT","TE","DT","DE")] <- "Linemen"
my_example[my_example %in% c("CB","WR","FS")]               <- "Small Backs"
my_example[my_example %in% c("FB","ILB","OLB","P",
                             "QB","RB","SS","WR")]          <- "Big Backs"

然后我将其转化为一个因子:
my_example <- as.factor(my_example)
head(my_example)
[1] Linemen Linemen Linemen Linemen Linemen Linemen
Levels: Big Backs Linemen Small Backs
tail(my_example)
[1] Big Backs   Big Backs   Big Backs   Big Backs   Big Backs   Small Backs
Levels: Big Backs Linemen Small Backs
class(my_example)

[1] "因子"


最好给它分配一个不同的名称,因为目标项可能是数据框中的一个因子。 - IRTFM
我做了这个:> nfldraft$Pos[nfldraft$Pos %in% c("C","OG","OT","TE","DT","DE")] <- "Linemen" 警告信息: 在 [<-.factor(tmp, nfldraft$Pos %in% c("C", "OG", "OT", "TE", : 无效的因子水平,生成 NA - Amin Sammara
1
@AminSammara 那是因为您没有从字符向量开始。 因此,请首先执行nfldraft$Pos <- as.character(nfldraft$Pos),然后您就可以了。 - Hack-R

1

这是一个需要完全可重现的示例的好例子。实际上,原帖作者的代码看起来应该可以工作。从@Hack-R的示例输入中获取:

my_example <- c("C","OG","OT","TE","DT","DE","CB","WR","FS", 
                "FB","ILB","OLB","P","QB","RB","SS","WR")

OP的原始代码可以直接使用:

nfldraft = list(Pos = factor(my_example))
levels(nfldraft$Pos) <- list(
  Linemen = c("C","OG","OT","TE","DT","DE"), 
  Small_Backs =  c("CB","WR","FS"), 
  Big_Backs = c("FB","ILB","OLB","P","QB","RB","SS","WR")
)
table(nfldraft$Pos)
#     Linemen Small_Backs   Big_Backs 
#           6           2           9 

这与如何使用 levels<- 的文档完全一致:

levels(x) <- value

valuelevels(x) 的有效值... 对于因子方法,是一个字符向量,其长度至少为 x 的级别数,或者是指定如何重命名级别的具有名称的列表。

因此,似乎 OP 的输入还有其他问题。


1
现在这个程序能够正常工作可能是由于R 3.5.0中的factor()函数发生了变化,请参见发布说明中的“R 3.5.0更改”部分。但是,您提供的可重现示例是完全正确的! - jay.sf

0

你也可以使用dplyr包中的mapvalues()函数。

在你的例子中,它应该是这样的:

Linemen_levels = c("C","OG","OT","TE","DT","DE")
Small_Backs_levels = c("CB","WR","FS")
Big_Backs_levels = c("FB","ILB","OLB","P","QB","RB","SS","WR")

nfldraft <- nfldraft %>% mutate(Pos=mapvalues(Pos, 
                 from = c(Linemen_levels, Small_Backs_levels, Big_Backs_levels),
                 to = c(rep('Linemen', length(Linemen_levels), rep('Small_Backs', length(Small_Backs_levels), rep('Big_Backs', length(Big_Backs_levels))))))

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接