如何在R中缩小数据框架

Question

如何在R中缩小数据框架

3

抱歉我的标题不够完美，但是我有些难以理解这个问题。

这里是手动创建的数据。有三个字段：州、代码类型和代码。原因是我正在尝试将更广泛版本的数据框加入到一个由160万行组成的数据框中，但遇到内存不足的问题。我的思考过程是，我会大大降低这个表格的行数；行业。

state <- c(32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32)
codetype <- c(10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10)
code <- c(522,523,524,532,533,534,544,545,546,551,552,552,561,562,563,571,572,573,574)



industry = data.frame(state,codetype,code)

期望的结果是进行两个步骤的操作。首先，我将六位代码缩短为两位。这是通过完成的。

industry<-industry %>% mutate(twodigit = substr(code,1,2).

这将产生第五列，twodigit。目前有19个值。但只有7个唯一的twodigit值; 52,53,54,55,56,57。如何告诉它删除所有非唯一的two digit数值？

- Tim Wilcox

2

你需要 industry %>% distinct(twodigit, .keep_all = TRUE) 吗？ - akrun

1

@akrun，把这个作为回答写下来。是的，它起作用了，谢谢您的帮助。 - Tim Wilcox

2个回答

1

使用unique()方法的方式：

library(tidyverse)

state <- c(32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32)
codetype <- c(10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10)
code <- c(522,523,524,532,533,534,544,545,546,551,552,552,561,562,563,571,572,573,574)
industry = data.frame(state,codetype,code)
industry<-industry %>% mutate(twodigit = substr(code,1,2))


unique(industry$twodigit) %>%
    map_dfr(~filter(industry, twodigit == .x)[1, ])
#>   state codetype code twodigit
#> 1    32       10  522       52
#> 2    32       10  532       53
#> 3    32       10  544       54
#> 4    32       10  551       55
#> 5    32       10  561       56
#> 6    32       10  571       57

^{本文创建于2021年6月10日，使用reprex软件包(v2.0.0)}

- jpdugo17

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- akrun · Accepted Answer

我们可以使用distinct并将.keep_all设为TRUE来获取整个列。

library(dplyr)
industry %>%
   distinct(twodigit, .keep_all = TRUE)

另一个选择是在filter中使用duplicated。

industry %>%
    filter(!duplicated(twodigit))

为了更高效，或许可以使用data.table的方法。

library(data.table)
setDT(industry)[!duplicated(substr(code, 1, 2))]