在R中计算字母数量

Question

在R中计算字母数量

3

我有一些文本页，想要查找出现在文本中的某个单词的起始和结束位置：

<body> I need to find the position of a **certain** word from a lot of text.</body>

例如，在此处，某些文本（不包括**）从第34个位置开始，到第40个位置结束。同时数字和标点符号也要计算在内。

我该如何在R中完成这个任务？文本是以xml格式呈现的。

- ElinaJ

有很多高效的答案，但我担心文本太长了，无法适应向量A或对象x...除了分割文本之外，还有什么解决方法吗？ - ElinaJ

你的文本有多长，目标词大约会出现多少次？ - lawyeR

其中一个文本长度为625个单词，超过6000个字符/空格。当我将其缩减至4000个字符/空格左右时，它可以适应向量/对象。搜索的关键词非常稀疏，因此该词通常只在文本中出现一次。 - ElinaJ

你能澄清一下吗？如果你解析XML文档并提取文本字符串，然后创建一个字符串向量，为什么答案找不到目标单词的起始和结束位置呢？这些字符串看起来都不是很大。 - lawyeR

这个问题让我想起了XY 问题的讨论中给出的例子。我很确定我们遇到了这种情况。 - Roland

我已经从xml中提取了文本（现在只是复制+粘贴），只要文本不太长，我就可以获取起始和结束位置（至少我怀疑这是原因，因为当我删除文本的结尾部分时它能正常工作）。但是当我尝试使用全部文本时，R-Studio会打印加号(+)作为最后一行，就好像等待某个参数继续执行函数... 另一方面，当我将文本删除到大约4000个字母/空格时，R-Studio确实创建了向量x，并且没有+号，而是>作为最后一行。那么一个向量的最大字符数是多少？ - ElinaJ

5个回答

4

stringi软件包有一个非常有用的功能：

x <- "I need to find the position of a certain word from a lot of certain text, which needs a certain text processing function."

> stringi::stri_locate_all_regex(str = x, "certain") # list of start and end locations for matches
[[1]]
     start end
[1,]    34  40
[2,]    61  67
[3,]    89  95

- lawyeR

1

我猜 stri_locate_all_fixed 可能会快一点。 - akrun

@akrun，谢谢。我很少考虑速度，但你的观点是正确的。 - lawyeR

3

您可以使用cwhmisc包。您应该将文本放入一个向量中。

library(cwhmisc)

A<-("I need to find the position of a certain word from a lot of text")

cpos(A, "certain")

- Ruthger Righart

2

您可以使用这个：

> regexpr("a","sjnasd")
[1] 4
attr(,"match.length")
[1] 1
attr(,"useBytes")
[1] TRUE

然而，这只适用于在较大的字符串中第一次出现子字符串的情况。

- Lalit Sachan

1

您还可以使用stringr的str_locate函数 - 请注意，这只是base::regexpr的一个包装器，但名称更加易记 :-)

> require(stringr)
> x <- "I need to find the position of a certain word from a lot of text."
> str_locate(x, "certain")
     start end
[1,]    34  40

- Peter Diakumis

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Roland · Accepted Answer

使用gregexpr函数：

x <- "I need to find the position of a certain word from a lot of certain text,
which needs a certain text processing function."
gregexpr("certain", x, fixed = TRUE)
#[[1]]
#[1] 34 61 89
#attr(,"match.length")
#[1] 7 7 7
#attr(,"useBytes")
#[1] TRUE