在字符串值中找到最常见的单词

Question

在字符串值中找到最常见的单词

3

我有这样的数据

df <- data.frame(A = c("a, a, a, b, b, c, c", "a, a, b, b, b, b, c", "a, a, b, b"), B = c(3, 5, 8))

我想要找到变量 A 的每个观测中最常见的单词，以,分隔。

我发现的所有方法都只会提取整个列中最常见的单词，例如：

table(unlist(strsplit(df$A,", "))) %>% which.max() %>% names()

我得到了什么

wrong_result <- data.frame(A = c("a, a, a, b, b, c, c", "a, a, b, b, b, b, c"), B = c(3, 5, 8), C = c("b", "b", "b"))

如果两个单词的频率相同，则应同时提取。结果应如下所示：

result <- data.frame(A = c("a, a, a, b, b, c, c", "a, a, b, b, b, b, c", "a, a, b, b"), B = c(3, 5, 8), C = c("a", "b", "a, b"))

- Anton

4个回答

1

这里有另一种解决方案，使用 tidyverse 包：

library(tidyverse)
df %>%
  # separate `A` into rows:
  separate_rows(A) %>%
  # for each combination of `B` and `A`...
  group_by(B, A) %>%
  # ... count the number of occurrence:
  summarise(N = n()) %>%
  # filter the maximum value(s):
  filter(N == max(N)) %>%
  # collapse the strings back together:
  summarise( 
            C = str_c(A, collapse = ',')
            ) %>%
  # select the new column `C`:
  select(C) %>%
  # bind this column back to the original `df`:
  bind_cols(., df)
# A tibble: 3 × 3
  C     A                       B
  <chr> <chr>               <dbl>
1 a     a, a, a, b, b, c, c     3
2 b     a, a, b, b, b, b, c     5
3 a,b   a, a, b, b              8

- Chris Ruehlemann

1

一个基础解决方案：

sapply(strsplit(df$A,", "), \(x) {
  tab <- table(x)
  toString(names(tab[tab == max(tab)]))
})

# [1] "a"    "b"    "a, b"

- Darren Tsai

0

这里是一个基于 R 4.2.0 中引入的新管道运算符的基本 R 解决方案。

df <- data.frame(A = c("a, a, a, b, b, c, c", "a, a, b, b, b, b, c", "a, a, b, b"), B = c(3, 5, 8))

strsplit(df$A,", ") |>
  lapply(table) |>
  lapply(\(x) names(x[x == max(x)])) |>
  sapply(toString)
#> [1] "a"    "b"    "a, b"

^{由reprex package (v2.0.1)于2022年7月23日创建}

- Rui Barradas

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Maël · Accepted Answer

你可以做：

library(dplyr)
library(stringr)
library(purrr)
df %>% 
  mutate(maxi = map(str_split(A, pattern = ", "), 
                    ~ toString(names(which(table(.x) == max(table(.x)))))))

#                    A B maxi
#1 a, a, a, b, b, c, c 3    a
#2 a, a, b, b, b, b, c 5    b
#3          a, a, b, b 8 a, b