在R中将因子转换为数值类型

Question

在R中将因子转换为数值类型

rcategorical-data

3

我有一些R中的因子，它们是形如$100,001 - $150,000、超过$150,000、$25,000等薪资范围，我想将它们转换为数字值（例如将因子$100,001 - $150,000转换为整数125000）。

同样地，我有教育类别，如高中文凭、本科生、博士等，我想给它们分配数字（例如，将博士赋予比高中文凭更高的值）。

给定包含这些值的数据框，我该如何做到这一点？

- raxacoricofallapatorius

@Stat：从中我并不清楚如何将每个因素映射到我选择的数字。 - orome

嗯，我不认为那会在这种情况下有所帮助，我正在快速准备答案。 - Mike Nute

3个回答

8

您可以使用car包中的recode函数。

例如：

library(car)
df$salary <- recode(df$salary, 
    "'$100,001 - $150,000'=125000;'$150,000'=150000")

如需了解如何使用此函数，请查看帮助文件。

- wmmurrah

0

我会创建一个值的向量，将其映射到您因子的级别，并进行映射。下面的代码解决方案不够优雅，因为我无法弄清如何使用向量进行索引，但是如果您的数据不是特别大，这种解决方案仍然能够完成任务。假设我们要将fact的因子元素映射到vals中的数字：

fact<-as.factor(c("a","b","c"))
vals<-c(1,2,3)

#for example:
vals[levels(fact)=="b"]
# gives: [1] 2

#now make an example data frame:
sample(1:3,10,replace=T)
data<-data.frame(fact[sample(1:3,10,replace=T)])
names(data)<-c("myvar")

#our vlookup function:
vlookup<-function(fact,vals,x) {
    #probably should do an error checking to make sure fact 
    #   and vals are the same length

    out<-rep(vals[1],length(x)) 
    for (i in 1:length(x)) {
        out[i]<-vals[levels(fact)==x[i]]
    }
    return(out)
}

#test it:
data$myvarNumeric<-vlookup(fact,vals,data$myvar)

这应该适用于你所描述的内容。

- Mike Nute

Mike，我认为索引是一个不错的方法；我认为这个方法会起作用。fact<-c(a=1,b=2,c=3); 然后 fact[data$myvar]。 - user20650

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- user20650 · Accepted Answer

用于货币转换

# data
df <- data.frame(sal = c("$100,001 - $150,000" , "over $150,000" , 
    "$25,000"), educ = c("High School Diploma", "Current Undergraduate",
   "PhD"),stringsAsFactors=FALSE)

 # Remove comma and dollar sign
temp <- gsub("[,$]","", df$sal)

# remove text
temp <- gsub("[[:alpha:]]","", temp)

# get average over range
df$ave.sal <- sapply(strsplit(temp , "-") , function(i) mean(as.numeric(i)))

如果您想以数字形式了解您的教育水平

df$educ.f <- as.numeric(factor(df$educ , levels=c("High School Diploma" ,
          "Current Undergraduate", "PhD")))


df
#                  sal                  educ  ave.sal educ.f
# 1 $100,001 - $150,000   High School Diploma 125000.5      1
# 2       over $150,000 Current Undergraduate 150000.0      2
# 3             $25,000                   PhD  25000.0      3

编辑

缺失/NA值不应该有影响。

# Data that includes missing values

df <- data.frame(sal = c("$100,001 - $150,000" , "over $150,000" , 
                 "$25,000" , NA), educ = c(NA, "High School Diploma", 
"Current Undergraduate", "PhD"),stringsAsFactors=FALSE)

重新运行上述命令以获取。

df
 #                 sal                  educ  ave.sal educ.f
# 1 $100,001 - $150,000                  <NA> 125000.5     NA
# 2       over $150,000   High School Diploma 150000.0      1
# 3             $25,000 Current Undergraduate  25000.0      2
# 4                <NA>                   PhD       NA      3