我有一些文本页,想要查找出现在文本中的某个单词的起始和结束位置:
<body> I need to find the position of a **certain** word from a lot of text.</body>
例如,在此处,某些文本(不包括**)从第34个位置开始,到第40个位置结束。同时数字和标点符号也要计算在内。
我该如何在R中完成这个任务?文本是以xml格式呈现的。
我有一些文本页,想要查找出现在文本中的某个单词的起始和结束位置:
<body> I need to find the position of a **certain** word from a lot of text.</body>
使用gregexpr
函数:
x <- "I need to find the position of a certain word from a lot of certain text,
which needs a certain text processing function."
gregexpr("certain", x, fixed = TRUE)
#[[1]]
#[1] 34 61 89
#attr(,"match.length")
#[1] 7 7 7
#attr(,"useBytes")
#[1] TRUE
stringi
软件包有一个非常有用的功能:
x <- "I need to find the position of a certain word from a lot of certain text, which needs a certain text processing function."
> stringi::stri_locate_all_regex(str = x, "certain") # list of start and end locations for matches
[[1]]
start end
[1,] 34 40
[2,] 61 67
[3,] 89 95
stri_locate_all_fixed
可能会快一点。 - akrunlibrary(cwhmisc)
A<-("I need to find the position of a certain word from a lot of text")
cpos(A, "certain")
您可以使用这个:
> regexpr("a","sjnasd")
[1] 4
attr(,"match.length")
[1] 1
attr(,"useBytes")
[1] TRUE
stringr
的str_locate
函数 - 请注意,这只是base::regexpr
的一个包装器,但名称更加易记 :-)> require(stringr)
> x <- "I need to find the position of a certain word from a lot of text."
> str_locate(x, "certain")
start end
[1,] 34 40