从推文中提取标签

Question

从推文中提取标签

rstatisticsanalyticshashtagsentiment-analysis

3

我正在尝试进行情感分析，但遇到了一个小问题。我使用的是一本包含标签和其他垃圾值（如下所示）的词典。它还与标签相关联的权重值。我想从中提取出仅标签及其相应的权重进入新数据帧。有什么简单的方法可以做到吗？我已经尝试使用regmatches，但某种方式它以列表格式输出并搞乱了事情。输入：

            V1    V2
1    #fabulous 7.526
2   #excellent 7.247
3      superb 7.199
4  #perfection 7.099
5    #terrific 6.922
6 #magnificent 6.672

输出：

            V1    V2
1    #fabulous 7.526
2   #excellent 7.247
3  #perfection 7.099
4    #terrific 6.922
5 #magnificent 6.672

- Kushal Bhola

2个回答

0

这段代码应该能够正常运行，并且会以 data.frame 的形式给出所需的输出。

 Input<- data.frame(V1 = c("#fabulous","#excellent","superb","#perfection","#terrific","#magnificent"), V2 = c("7.526",  "7.247" , "7.199", "7.099",  "6.922", "6.672")) 
 extractHashtags <- Input[which(substr(Input$V1,1,1) == "#"),]
 View(extractHashtags)

- NiketD

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- plannapus · Accepted Answer

如果您只想选择标签条目，可以使用简单的正则表达式^#（表示“以#开头的任何内容”）：

> input[grepl("^#",input[,1]),]
            V1    V2
1    #fabulous 7.526
2   #excellent 7.247
4  #perfection 7.099
5    #terrific 6.922
6 #magnificent 6.672

除了您的原始数据之外，正则表达式#[[:alnum:]]+（意思是：“一个标签，后面跟着1个或多个字母数字字符”）应该帮助您抓取标签：

> tweets <- c("New R job: Statistical and Methodological Consultant at the Center for Open Science http://www.r-users.com/jobs/statistical-methodological-consultant-center-open-science/ … #rstats #jobs","New R job: Research Engineer/Applied Researcher at eBay http://www.r-users.com/jobs/research-engineerapplied-researcher-ebay/ … #rstats #jobs")
> match <- regmatches(tweets,gregexpr("#[[:alnum:]]+",tweets))
> match
[[1]]
[1] "#rstats" "#jobs"  

[[2]]
[1] "#rstats" "#jobs"  
> unlist(match)
[1] "#rstats" "#jobs"   "#rstats" "#jobs"