数值错误:空词汇表;可能文档只包含停用词。

3

我第一次使用scikit库,并且遇到了这个错误:

ValueError: empty vocabulary; perhaps the documents only contain stop words
File "C:\Users\A605563\Desktop\velibProjetPreso\TraitementTwitterDico.py", line 33, in <module>
X_train_counts = count_vect.fit_transform(FileTweets)
File "C:\Python27\Lib\site-packages\sklearn\feature_extraction\text.py", line 804, in fit_transform
self.fixed_vocabulary_)
File "C:\Python27\Lib\site-packages\sklearn\feature_extraction\text.py", line 751, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only contain stop words

但我不明白为什么会发生这种情况。
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy
import unicodedata
import nltk


TweetsFile = open('tweets2015-08-13.csv', 'r+')
f2 = open('analyzer.txt', 'a')
print TweetsFile.readline()
count_vect = CountVectorizer(strip_accents='ascii')
FileTweets =  TweetsFile.read()
FileTweets = FileTweets.decode('latin1')
FileTweets = unicodedata.normalize('NFKD', FileTweets).encode('ascii','ignore')
print FileTweets
for line in TweetsFile:
    f2.write(line.replace('\n', ' '))
TweetsFile = f2
print type(FileTweets)
X_train_counts = count_vect.fit_transform(FileTweets)
print X_train_counts.shape
TweetsFile.close()

我的数据是原始的推文:

11/8/2015 @ Paris Marriott Champs Elysees Hotel "
2015-08-11 21:27:15,"I'm at Paris Marriott Hotel Champs-Elysees in Paris, FR <https://t.co/gAFspVw6FC>"
2015-08-11 21:24:08,"I'm at Four Seasons Hotel George V in Paris, Ile-de-France <https://t.co/dtPALvziWy>"
2015-08-11 21:22:11,    . @ Avenue des Champs-Elysees <https://t.co/8b7U05OAxG>
2015-08-11 20:54:18,Her pistol go @ Raspoutine Paris (Official) <https://t.co/le9l3dtdgM>
2015-08-11 20:50:14,"Desde Paris, con amor. @ Avenue des Champs-Elysees <https://t.co/R68JV3NT1z>"

有人知道这里发生了什么吗?


我不熟悉这个库,但是你是否应该将文件或其他参数传递给 CountVectorizer(strip_accents='ascii') - SuperBiasedMan
我猜测当你运行 count_vect.fit_transform(FileTweets) 时,FileTweets 是空的。你能展示一下 FileTweets 的样子吗? - Harpal
当我打印FileTweets时,我得到了以下内容: 11/8/2015 @ 巴黎万豪香榭丽舍酒店 " 2015-08-11 21:27:15,“我在法国巴黎的巴黎万豪酒店香榭丽舍分店 https://t.co/gAFspVw6FC” 2015-08-11 21:24:08,“我在法国伊尔德弗兰斯的乔治五世四季酒店 https://t.co/dtPALvziWy” 2015-08-11 21:22:11,. @ 香榭丽舍大街 https://t.co/8b7U05OAxG 2015-08-11 20:54:18,她的手枪在巴黎的Raspoutine俱乐部(官方)https://t.co/le9l3dtdgM 2015-08-11 20:50:14,“从巴黎,带着爱。@ 香榭丽舍大街 https://t.co/R68JV3NT1z”这是一个简短的摘录。 - Honolulu
嗯,标点符号可能是问题所在。尝试删除所有的 '"。我刚刚运行了你的输出,对我来说工作得很好。虽然我确实不得不删除所有引号。 - Harpal
在链接上没有看到任何数据。 - Harpal
显示剩余2条评论
2个回答

0

这是一个更简单的解决方案:

x = open('bad_words_train.txt', 'r+')
count_vect = CountVectorizer(input=file)
X_train = count_vect.fit_transform(x)
print(X_train)

0
我找到了一个解决方案:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
import unicodedata
import nltk 
from io import StringIO


TweetsFile = open('tweets2015-08-13.csv','r+')
yourResult = [line.split(',') for line in TweetsFile.readlines()]
count_vect = CountVectorizer(input="file")
docs_new = [ StringIO.StringIO(x) for x in yourResult ]
X_train_counts = count_vect.fit_transform(docs_new)
vocab = count_vect.get_feature_names()
print X_train_counts.shape

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接