我有一个包含上百行的R数据框,如下所示:
word Freq
seed 4
seeds 3
contract 2
contracting 2
river 1
我希望可以按照模式对数据进行分组,比如种子+seeds...看起来像这样:
word Freq
seed 7
contract 4
river 1
我有一个包含上百行的R数据框,如下所示:
word Freq
seed 4
seeds 3
contract 2
contracting 2
river 1
word Freq
seed 7
contract 4
river 1
这里可能还有另一种方法。在SnowballC
包中,有一个可以清理单词并获取单词词干的函数(即wordStem()
)。使用它,您可以跳过字符串操作。完成这个过程后,您只需要获取单词频率总和。
library(SnowballC)
library(dplyr)
mydf <- read.table(text = "word Freq
seed 4
seeds 3
contract 2
contracting 2
river 1", header = T)
mutate(mydf, word = wordStem(word)) %>%
group_by(word) %>%
summarise(total = sum(Freq))
# word total
# (chr) (int)
#1 contract 4
#2 river 1
#3 seed 7
一种选择是通过提取基于“word”中最小字符数量的子字符串来创建分组变量“gr”,再用“word”执行一次此操作,以便我们可以获取每组单词的子字符串,然后按“word”的“Freq”进行求和。
library(dplyr)
df1 %>%
group_by(gr= substr(word, 1, min(nchar(word)))) %>%
group_by(word= substr(word, 1, min(nchar(word)))) %>%
summarise(Freq= sum(Freq))
word Freq
# (chr) (int)
#1 contract 4
#2 river 1
#3 seed 7
使用adist
尝试匹配这些术语。
dat$grp <- seq(nrow(dat))
# generate a matrix comparing the vector of words to themselves
tmp <- adist(dat$word, dat$word, partial=TRUE)
diag(tmp) <- Inf
dat$grp[col(tmp)[tmp==0]] <- row(tmp)[tmp==0]
final <- aggregate(Freq ~ grp, data=dat, sum)
final$word <- dat$word[match(final$grp, dat$grp)]
# grp Freq word
#1 1 7 seed
#2 3 4 contract
#3 5 1 river
使用的数据:
dat <- data.frame(word=c("seed","seeds","contract","contracting","river"),Freq=c(4,3,2,2,1))
library(dplyr)
library(stringi)
df %>%
merge(df %>% select(short_word = word) ) %>%
filter(short_word %>%
stri_detect_regex(word, .) ) %>%
group_by(word) %>%
slice(short_word %>% stri_length %>% which.min) %>%
group_by(short_word) %>%
summarise(Freq= sum(Freq))