选取数据框中出现最频繁的元素,同时使用表格。

3

我有一个数据框的列表,想要使用table函数。该列表如下:

pronouns <- data.frame(pronounciation = c("juː","juː","juː","ju","ju","jə","jə","hɪm","hɪm","hɪm", "həm","ðɛm"), words = c("you","you","you","you","you","you","you","him","him","him","him","them"))
articles <- data.frame(pronounciation = c("ðiː","ði","ði","ðə","ðə","ði","ðə","eɪ","eɪ","æɪ","æɪ","eɪ","eɪ","eɪ","e"), words = c("the","the","the","the","the","the","the","a","a","a","a","a","a","a","a"))
numbers <- data.frame(pronounciation = c("wʌn","wʌn","wʌn","wʌn","wan","wa:n","tuː","tuː","tuː","tuː","tu","tu","tuː","tuː","θɹiː"), words = c("one","one","one","one","one","one","two","two","two","two","two","two","two","two","three"))
ls <- list(pronouns, articles, numbers)

ls[[1]]
   pronounciation words
1             juː   you
2             juː   you
3             juː   you
4              ju   you
5              ju   you
6              jə   you
7              jə   you
8             hɪm   him
9             hɪm   him
10            hɪm   him
11            həm   him
12            ðɛm  them

从这个数据框列表中,我想使用table()提取$words的列联表,同时选择每个单词的最常见发音。所需结果在ls_out中:

pronouns_out <- data.frame(pronounciation = c("juː","hɪm","ðɛm"), words = c("you","him","them"), occurence = c(7,4,1))
articles_out <- data.frame(pronounciation = c("ði","eɪ"), words = c("the","a"), occurence = c(7,8))
numbers_out <- data.frame(pronounciation = c("wʌn","tuː","θɹiː"), words = c("one","two","three"), occurence = c(6,8,1))
ls_out <- list(pronouns_out, articles_out, numbers_out)

ls_out[[1]]
  pronounciation words occurence
1            juː   you         7
2            hɪm   him         4
3            ðɛm  them         1

如果两个或多个发音频率相同(例如ls [[2]]中的ði和ðə),需要随机选择一个发音。
欢迎提供任何关于如何实现此目标的建议。
3个回答

1
使用 table(和 lapply):

ff = function(pronounce, word) 
{
    tab = table(word, pronounce)
    data.frame(pronounciation = colnames(tab)[max.col(tab, "random")], 
               words = rownames(tab),
               occurences = unname(rowSums(tab)))
}

lapply(ls, function(x) ff(x$pronounciation, x$words))

#[[1]]
#     pronounciation words occurences
#1        h<U+026A>m   him          4
#2 <U+00F0><U+025B>m  them          1
#3        ju<U+02D0>   you          7
#
#[[2]]
#  pronounciation words occurences
#1      e<U+026A>     a          8
#2      <U+00F0>i   the          7
#
#[[3]]
#      pronounciation words occurences
#1         w<U+028C>n   one          6
#2 θ<U+0279>i<U+02D0> three          1
#3         tu<U+02D0>   two          8   

请注意:“如果两个或多个发音的频率相同(例如ls[[2]]中的ði和ðə),则需要随机选择一个发音。” - MichaelChirico
@MichaelChirico:错过了那部分内容,现在已经修复。 - alexis_laz

0
使用`data.table`库-
library(data.table)

dtlist<-list(pronouns,articles,numbers)
lapply(dtlist,setDT)

# for each data.table in the dtlist, calculate frequency by pron, words
dtlistfreq1 <- 
  lapply(dtlist, function(x) x[,.(freq = .N), by = .(pronunciation,words)])
# for each data.table in the dtlistfreq, pick the highest freq by words
dtlistfreq2 <- 
  lapply(dtlistfreq1, function(x) x[,.SD[which.max(freq)], by = .(words)])

输出

> dtlistfreq2 
[[1]]
   words pronounciation freq
1:   you            ju?    3
2:   him            h?m    4
3:  them            ð?m    1

[[2]]
   words pronounciation freq
1:   the             ði    3
2:     a             e?    5

[[3]]
   words pronounciation freq
1:   one            w?n    4
2:   two            tu?    6
3: three           ??i?    1

我认为这实际上是不正确的 - 看起来 OP 想要每个单词的总出现次数,而不是单词/发音对的出现次数。请看我的解决方案。 - MichaelChirico
此外,还需注意:“如果两个或更多发音的频率相同(如ls [[2]]中的ði和ðə),则需要随机选择一个发音。” - MichaelChirico

0

这里有一个使用 data.table 的解决方案,我认为它可以得到你最初想要的结果,其中 occurrence 是每个 word 出现的总次数,而不是 (word,pronunciation) 对的数量:

dtlist<-list(pronouns,articles,numbers)
lapply(dtlist,setDT)

common_r<-function(x){
  t<-sort(table(x),decreasing=T)
  n<-length(t[t==max(t)])
  c<-if (n>1)names(t)[ceiling(n*runif(1))] else names(t)[1]
  c
}
lapply(dtlist,function(x)setcolorder(x[,.(occurrence=.N,
                                       pronunciation=common_r(pronunciation)),
                                       by=words]),
                                     c("pronunciation","words","occurrence")))

输出:

[[1]]
   pronunciation words occurrence
1:           juː   you          7
2:           hɪm   him          4
3:           ðɛm  them          1

[[2]]
   pronunciation words occurrence
1:            ði   the          7
2:            eɪ     a          8

[[3]]
   pronunciation words occurrence
1:           wʌn   one          6
2:           tuː   two          8
3:          θɹiː three          1

请注意,当最常见的发音不唯一时,我已经采取了随机化的措施;如果它总是唯一的(或者在这种情况下您不关心选择哪个发音),则可以简化此过程:
common_r<-function(x){names(sort(table(x),decreasing=T))[1]}

如果你不想为不同的单词类别携带3个单独的列表,那么可以通过将lapply包装在rbindlist中来进一步简化输出:

   pronunciation words occurrence
1:           juː   you          7
2:           hɪm   him          4
3:           ðɛm  them          1
4:            ði   the          7
5:            eɪ     a          8
6:           wʌn   one          6
7:           tuː   two          8
8:          θɹiː three          1

我们还可以在这个新的data.table中添加一个category字段,表示每个单词来自哪个类别。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接