使用SentiWS德语语言集进行Twitter情感分析的R语言

4
我想对德语推文进行情感分析。我使用的代码在处理英语时运行良好,但是当我加载德语词汇表时,所有得分都为零。据我猜测,这可能与词汇表的不同结构有关。因此,我需要知道如何使我的代码适应德语词汇表的结构。有人可以查看这两个列表吗? 英语词汇表 德语词汇表
    # load the wordlists
    pos.words = scan("~/positive-words.txt",what='character', comment.char=';')
    neg.words = scan("~/negative-words.txt",what='character', comment.char=';')

        # bring in the sentiment analysis algorithm
        # we got a vector of sentences. plyr will handle a list or a vector as an "l" 
        # we want a simple array of scores back, so we use "l" + "a" + "ply" = laply:
        score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
         { 
          require(plyr)
          require(stringr)
            scores = laply(sentences, function(sentence, pos.words, neg.words) 
            {
             # clean up sentences with R's regex-driven global substitute, gsub():
             sentence = gsub('[[:punct:]]', '', sentence)
             sentence = gsub('[[:cntrl:]]', '', sentence)
             sentence = gsub('\\d+', '', sentence)
             # and convert to lower case:
             sentence = tolower(sentence)
             # split into words. str_split is in the stringr package
             word.list = str_split(sentence, '\\s+')
             # sometimes a list() is one level of hierarchy too much
             words = unlist(word.list)
             # compare our words to the dictionaries of positive & negative terms
             pos.matches = match(words, pos.words)
             neg.matches = match(words, neg.words)
             # match() returns the position of the matched term or NA
             # we just want a TRUE/FALSE:
             pos.matches = !is.na(pos.matches)
             neg.matches = !is.na(neg.matches)
             # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
             score = sum(pos.matches) - sum(neg.matches)
             return(score)
            }, 
          pos.words, neg.words, .progress=.progress )
          scores.df = data.frame(score=scores, text=sentences)
          return(scores.df)
         }

    # and to see if it works, there should be a score...either in German or in English
    sample = c("ich liebe dich. du bist wunderbar","I hate you. Die!");sample
    test.sample = score.sentiment(sample, pos.words, neg.words);test.sample
2个回答

3
这可能适用于您:

这可能适用于您:

readAndflattenSentiWS <- function(filename) { 
  words = readLines(filename, encoding="UTF-8")
  words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words)
  words <- unlist(strsplit(words, ","))
  words <- tolower(words)
  return(words)
}
pos.words <- c(scan("positive-words.txt",what='character', comment.char=';', quiet=T), 
               readAndflattenSentiWS("SentiWS_v1.8c_Positive.txt"))
neg.words <- c(scan("negative-words.txt",what='character', comment.char=';', quiet=T), 
              readAndflattenSentiWS("SentiWS_v1.8c_Negative.txt"))

score.sentiment = function(sentences, pos.words, neg.words, .progress='none') {
  # ... see OP ...
}

sample <- c("ich liebe dich. du bist wunderbar",
            "Ich hasse dich, geh sterben!", 
            "i love you. you are wonderful.",
            "i hate you, die.")
(test.sample <- score.sentiment(sample, 
                                pos.words, 
                                neg.words))
#   score                              text
# 1     2 ich liebe dich. du bist wunderbar
# 2    -2      ich hasse dich, geh sterben!
# 3     2    i love you. you are wonderful.
# 4    -2                  i hate you, die.

2
在德语列表中,列表的名称如下: SentiWS_v1.8c_Negative.txt 和 SentiWS_v1.8c_Positive.txt。 但是,按照你加载的方式,这只适用于英文版本。
pos.words = scan("~/positive-words.txt",what='character', comment.char=';')
neg.words = scan("~/negative-words.txt",what='character', comment.char=';')

除此之外,列表的格式也不同:
德语版本如下:
 Abbau|NN   -0.058  Abbaus,Abbaues,Abbauen,Abbaue  
 Abbruch|NN -0.0048 Abbruches,Abbrüche,Abbruchs,Abbrüchen  
 Abdankung|NN   -0.0048 Abdankungen
 Abdämpfung|NN  -0.0048 Abdämpfungen  
 Abfall|NN  -0.0048 Abfalles,Abfälle,Abfalls,Abfällen  
 Abfuhr|NN  -0.3367 Abfuhren  

英文版本:

charismatic
charitable
charm
charming
charmingly
chaste
cheaper
cheapest

德文版本遵循此格式: word|NN\tnumber <similar words comma separated>\n
英文版本遵循此格式: word\n
每个文档的标题都不同,因此您可能想要跳过标题(在英语列表中似乎是一篇文章,不是推文或推文的单词)

解决方法是将两个文件的格式设置为相同,然后做你想做的事情,或准备好你的代码从两种类型的数据读取。
现在,您已经可以为英文版本编写程序了,所以建议更改德语列表的格式。您可以将每个空格或逗号更改为 \ n ,然后消除所有| NN和数字。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接