在一个特定的单词前后提取5个词

Question

在一个特定的单词前后提取5个词

r

3

如何提取特定单词旁边的单词/句子？例如：

“6月28日，简去电影院吃爆米花”

我想选择“简”，并获得[-2,2]，意思是：

“6月28日，简去了”

- Ivancito

4个回答

3

我有一个使用 stringr 的 str_extract 更简短的版本。

library(stringr)
txt <- "On June 28, Jane went to the cinema and ate popcorn"
str_extract(txt,"([^\\s]+\\s+){2}Jane(\\s+[^\\s]+){2}")

[1] "June 28, Jane went to"

函数str_extract从字符串中提取模式。正则表达式\\s用于匹配空格，而[^\\s]则是它的否定，即匹配除空格以外的任何字符。因此整个模式是在Jane之前和之后各有两个空格，并由除空格以外的任何内容组成。

优点是它已经向量化了，如果你有一个文本向量，可以使用str_extract_all：

s <- c("On June 28, Jane went to the cinema and ate popcorn. 
          The next day, Jane hiked on a trail.",
       "an indeed Jane loved it a lot")

str_extract_all(s,"([^\\s]+\\s+){2}Jane(\\s+[^\\s]+){2}")

[[1]]
[1] "June 28, Jane went to"   "next day, Jane hiked on"

[[2]]
[1] "an indeed Jane loved it"

- denis

有没有一种方法只获取单词？不需要逗号或任何其他东西，只需要简单的单词以便之后进行计数。 - Ivancito

您可以使用正则表达式中的 \\w 获取单词。str_extract(txt, "\\w") 将提取所有单词。 - denis

当我尝试使用以下代码时，它会返回NA： str_extract_all(s,"([^\w]+\w+){2}Jane(\w+[^\w]+){2}") - Ivancito

2

这是一个关于多次出现的扩展示例。基本上，按空格分割，找到单词，展开索引，然后生成结果列表。

s <- "On June 28, Jane went to the cinema and ate popcorn. The next day, Jane hiked on a trail."
words <- strsplit(s, '\\s+')[[1]]
inds <- grep('Jane', words)
lapply(inds, FUN = function(i) {
  paste(words[max(1, i-2):min(length(words), i+2)], collapse = ' ')
})
#> [[1]]
#> [1] "June 28, Jane went to"
#> 
#> [[2]]
#> [1] "next day, Jane hiked on"

^{本文创建于2019年9月17日，使用reprex软件包（版本0.3.0）。}

- ClancyStats

-1

这应该可以工作：

stringr::str_extract(text, "(?:[^\\s]+\\s){5}Jane(?:\\s[^\\s]+){5}")

- Souvik Das

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- AndS. · Accepted Answer

我们可以编写一个函数来帮忙。这可以使它更加动态。

library(tidyverse)

txt <- "On June 28, Jane went to the cinema and ate popcorn"

grab_text <- function(text, target, before, after){
  min <- which(unlist(map(str_split(text, "\\s"), ~grepl(target, .x))))-before
  max <- which(unlist(map(str_split(text, "\\s"), ~grepl(target, .x))))+after

  paste(str_split(text, "\\s")[[1]][min:max], collapse = " ")
}

grab_text(text = txt, target = "Jane", before = 2, after  = 2)
#> [1] "June 28, Jane went to"

首先我们将句子分割，然后确定目标的位置，接着获取目标之前或之后的单词（函数中指定的数量），最后将句子重新拼合在一起。