使用查找表替换字符串 dplyr

6

我正在尝试在R中创建查找表,以便使我的数据与所在公司的格式相同。

它涉及我想要使用dplyr合并的不同教育类别。

library(dplyr)

# Create data
education <- c("Mechanichal Engineering","Electric Engineering","Political Science","Economics")

    data <- data.frame(X1=replicate(1,sample(education,1000,rep=TRUE)))

    tbl_df(data)

    # Create lookup table
    lut <- c("Mechanichal Engineering" = "Engineering",
             "Electric Engineering" = "Engineering",
             "Political Science" = "Social Science",
             "Economics" = "Social Science")

    # Assign lookup table
    data$X1 <- lut[data$X1]

但是在我的输出中,我的旧值被替换成了错误的值,即不是我在查找表中创建的值。相反,似乎查找表是随机分配的。


1
请在您的问题中包含 dput(data) - nrussell
3
请确保您提供的示例代码可以被他人运行(将注释放在“#”之后,不要更改 data/Data 的大写)。这里有一个指南:https://dev59.com/eG025IYBdhLWcg3whGSx#28481250。 - Frank
1
Data$education <- lut[as.character(data$education)] 这段代码怎么样? - MrFlick
@Frank 谢谢。已修复。但同意应该在问题中提供可重现的示例。 - MrFlick
也许使用连接到您首选值的映射甚至更有效? - leerssej
4个回答

3

我刚刚试着自己解决了这个问题。我对大部分我找到的解决方案都不是很满意,所以这就是我最后采用的方法。我添加了一个“其他”类别来表明即使在查找表中未定义值也可以工作。

library(dplyr)

# Create data
education <- c("Mechanichal Engineering",
               "Electric Engineering",
               "Political Science",
               "Economics",
               "Other")

data <- data.frame(X1 = replicate(1, sample(education, 20, rep=TRUE)))

# Create lookup table
lut <- c("Mechanichal Engineering" = "Engineering",
         "Electric Engineering" = "Engineering",
         "Political Science" = "Social Science",
         "Economics" = "Social Science")

data %>%
    mutate(X2 = recode(X1, !!!lut))
#>                         X1             X2
#> 1     Electric Engineering    Engineering
#> 2                    Other          Other
#> 3                    Other          Other
#> 4                    Other          Other
#> 5                    Other          Other
#> 6        Political Science Social Science
#> 7                    Other          Other
#> 8                Economics Social Science
#> 9        Political Science Social Science
#> 10    Electric Engineering    Engineering
#> 11               Economics Social Science
#> 12               Economics Social Science
#> 13 Mechanichal Engineering    Engineering
#> 14               Economics Social Science
#> 15       Political Science Social Science
#> 16                   Other          Other
#> 17                   Other          Other
#> 18                   Other          Other
#> 19 Mechanichal Engineering    Engineering
#> 20       Political Science Social Science

2
education <- c("Mechanichal Engineering","Electric Engineering","Political Science","Economics")
lut <- list("Mechanichal Engineering" = "Engineering",
            "Electric Engineering" = "Engineering",
            "Political Science" = "Social Science",
            "Economics" = "Social Science")
lut2<-melt(lut)
data1 <- data.frame(X1=replicate(1,sample(education,1000,rep=TRUE)))
data1$new <- lut2[match(data1$X1,lut2$L1),'value']
head(data1)


=======================  ==============
X1                       new           
=======================  ==============
Political Science        Social Science
Political Science        Social Science
Mechanichal Engineering  Engineering   
Mechanichal Engineering  Engineering   
Political Science        Social Science
Political Science        Social Science
=======================  ==============

这种方法的问题在于,如果我只想重新编码一个字符串,那么我的新变量将包含许多NA。如果我有一个包含大量字符串但只想重新编码其中一些的数据集,那就是个问题。 - FilipW
可以再详细说明一下吗?你是说如果只想用“Engineering”替换“Mechanical Engineering”,而忽略其他内容,是这个意思吗? - Carl Boneri
没错。考虑一下如果我有1000个不同的因素,只想重命名其中几个,或者像这种情况一样,合并不同的因素。那么这个解决方案将会产生一个新变量,其中每个未被明确重命名的因素都将是N/A。 - FilipW

1

我曾经遇到过类似的问题,我的数据框有很多列,其中一列有大约95个不同的值。我想创建另一列,将这95个值分组(映射)成更易管理的内容。我创建了一个简单的数据框,将映射到列的映射表作为查找表。

在您的情况下,与其执行此步骤:

# Create lookup table
lut <- c("Mechanichal Engineering" = "Engineering",
         "Electric Engineering" = "Engineering",
         "Political Science" = "Social Science",
         "Economics" = "Social Science")

你可以创建一个简单的数据框:

subject = c("Mechanichal Engineering",
             "Electric Engineering",
             "Political Science"",
             "Economics")

category = c("Engineering",
             "Engineering",
             "Social Science",
             "Social Science")

lookup_table = data.frame(subject, category)

你需要两个库来实现哈希表和查找:
library(hash)
library(qdapTools)

使用两个列作为哈希表加载一个名为lookup_table的数据框:

在此数据框中,第一列将是键,第二列是查找值。

# make the hash table
h = hash::hash(keys = lookup_table$subject, values = lookup_table$category)

# create the categories for your education values
# find the match of education in your h table and return the value category from the h table

data$ed_category = hash_look(data$education, h, missing = data$education)

现在你的输出将会像下面这样;使用缺失的参数只是使用 Other,因为它不在哈希表中,而是在教育列中。
#>                  education    ed_category
#> 1     Electric Engineering    Engineering
#> 2                    Other          Other
#> 3                    Other          Other
#> 4                    Other          Other
#> 5                    Other          Other
#> 6        Political Science Social Science

0
我发现最好的方法是使用car包中的recode()函数。
# Observe that dplyr also has a recode function, so require car after dplyr
    require(dplyr)
    require(car)

这些数据是从四个教育类别中抽样得到的。

    education <- c("Mechanichal Engineering",
                   "Electric Engineering","Political Science","Economics")

data <- data.frame(ID = c(1:1000), X1 = replicate(1,sample(education,1000,rep=TRUE)))

使用recode()函数对数据进行重新编码,以重新编码类别。

lut <- data.frame(ID = c(1:1000), X2 = recode(data$X1, '"Economics" = "Social Science";
                         "Electric Engineering" = "Engineering";
                          "Political Science" = "Social Science";
                          "Mechanichal Engineering" = "Engineering"'))

为了确认是否正确执行,需要将原始数据和重新编码后的数据合并。
data <- full_join(data, lut, by = "ID")

head(data)

   ID                     X1             X2
1  1       Political Science Social Science
2  2               Economics Social Science
3  3    Electric Engineering    Engineering
4  4       Political Science Social Science
5  5               Economics Social Science
6  6 Mechanichal Engineering    Engineering

使用 recode,您无需在重新编码数据之前对其进行排序。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接