使用R语言提取多个推文中的主题标签

4

我急需一种方法来从R中的集体推文中提取标签。例如:

[[1]]
[1] "RddzAlejandra: RT @NiallOfficial: What a day for @johnJoeNevin ! Sooo proud t have been there to see him at #London2012 and here in mgar #MullingarShuffle"

[[2]]
[1] "BPOInsight: RT @atos: Atos completes delivery of key IT systems for London 2012 Olympic Games http://t.co/Modkyo2R #london2012"

[[3]]
[1] "BloombergWest: The #Olympics sets a ratings record for #NBC, with 219M viewers tuning in. http://t.co/scGzIXBp #london2012 #tech"

我应该如何解析它以提取所有推文中的标签词列表。 之前的解决方案只显示了第一条推文中的标签,而在代码中显示了这些错误消息:
> string <-"MonicaSarkar: RT @saultracey: Sun kissed #olmpicrings at #towerbridge #london2012   @ Tower Bridge http://t.co/wgIutHUl"
> 
> [[2]]
Error: unexpected '[[' in "[["
> [1] "ccrews467: RT @BBCNews: England manager Roy Hodgson calls #London2012 a \"wake-up call\": footballers and fans should emulate spirit of #Olympics http://t.co/wLD2VA1K" 
Error: unexpected '[' in "["
> hashtag.regex <- perl("(?<=^|\\s)#\\S+")
> hashtags <- str_extract_all(string, hashtag.regex)
> print(hashtags)
[[1]]
[1] "#olmpicrings" "#towerbridge" "#london2012" 

1
如果你发布你之前的代码,我们可能可以向你展示如何循环或递归扫描所有yourdata [[1:n]] [1]元素。 - Carl Witthoft
仅仅是提醒一下,使用双重方括号中的向量会导致 "尝试选择多个元素" 错误 :) - Sacha Epskamp
如果回答满意解决了你的问题,请接受该答案。如果不满意,请在该回答下评论解释原因。 - Sacha Epskamp
@SachaEpskamp -- 是的,我太着急了,试图描述原帖作者可能正在搜索的数据范围。抱歉。 - Carl Witthoft
2个回答

9
使用regmatchesgregexpr,这将给您一个包含每个推文的标签列表,假设标签格式为#后跟任意数量的字母或数字(我对Twitter不太熟悉):
foo <- c("RddzAlejandra: RT @NiallOfficial: What a day for @johnJoeNevin ! Sooo proud t have been there to see him at #London2012 and here in mgar #MullingarShuffle","BPOInsight: RT @atos: Atos completes delivery of key IT systems for London 2012 Olympic Games http://t.co/Modkyo2R #london2012","BloombergWest: The #Olympics sets a ratings record for #NBC, with 219M viewers tuning in. http://t.co/scGzIXBp #london2012 #tech")

regmatches(foo,gregexpr("#(\\d|\\w)+",foo))

返回:

[[1]]
[1] "#London2012"       "#MullingarShuffle"

[[2]]
[1] "#london2012"

[[3]]
[1] "#Olympics"   "#NBC"        "#london2012" "#tech"  

3
如何使用 `strsplit` 和 `grep` 实现这个功能:
```R ```
> lapply(strsplit(x, ' '), function(w) grep('#', w, value=TRUE))
[[1]]
[1] "#London2012"       "#MullingarShuffle"

[[2]]
[1] "#london2012"

[[3]]
[1] "#Olympics"   "#NBC,"       "#london2012" "#tech"      

我无法想象在不先分割字符串的情况下如何返回每个字符串的多个结果,但我相信肯定有方法!


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接