我正在尝试在这份数据样本上运行w2v
Statement Label
Says the Annies List political group supports third-trimester abortions on demand. FALSE
When did the decline of coal start? It started when natural gas took off that started to begin in (President George W.) Bushs administration. TRUE
"Hillary Clinton agrees with John McCain ""by voting to give George Bush the benefit of the doubt on Iran.""" TRUE
Health care reform legislation is likely to mandate free sex change surgeries. FALSE
The economic turnaround started at the end of my term. TRUE
The Chicago Bears have had more starting quarterbacks in the last 10 years than the total number of tenured (UW) faculty fired during the last two decades. TRUE
Jim Dunnam has not lived in the district he represents for years now. FALSE
使用此 GitHub 文件夹(FeatureSelection.py)中提供的代码:
https://github.com/nishitpatel01/Fake_News_Detection
我想在朴素贝叶斯模型中包含word2vec特征。首先,我考虑了X和y并使用了train_test_split:
X = df['Statement']
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
dataset = pd.concat([X_train, y_train], axis=1)
这是我目前在使用的代码:
#Using Word2Vec
with open("glove.6B.50d.txt", "rb") as lines:
w2v = {line.split()[0]: np.array(map(float, line.split()[1:]))
for line in lines}
training_sentences = DataPrep.train_news['Statement']
model = gensim.models.Word2Vec(training_sentences, size=100) # x be tokenized text
w2v = dict(zip(model.wv.index2word, model.wv.syn0))
class MeanEmbeddingVectorizer(object):
def __init__(self, word2vec):
self.word2vec = word2vec
# if a text is empty we should return a vector of zeros
# with the same dimensionality as all the other vectors
self.dim = len(word2vec.itervalues().next())
def fit(self, X, y): # what are X and y?
return self
def transform(self, X): # should it be training_sentences?
return np.array([
np.mean([self.word2vec[w] for w in words if w in self.word2vec]
or [np.zeros(self.dim)], axis=0)
for words in X
])
"""
class TfidfEmbeddingVectorizer(object):
def __init__(self, word2vec):
self.word2vec = word2vec
self.word2weight = None
self.dim = len(word2vec.itervalues().next())
def fit(self, X, y):
tfidf = TfidfVectorizer(analyzer=lambda x: x)
tfidf.fit(X)
# if a word was never seen - it must be at least as infrequent
# as any of the known words - so the default idf is the max of
# known idf's
max_idf = max(tfidf.idf_)
self.word2weight = defaultdict(
lambda: max_idf,
[(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])
return self
def transform(self, X):
return np.array([
np.mean([self.word2vec[w] * self.word2weight[w]
for w in words if w in self.word2vec] or
[np.zeros(self.dim)], axis=0)
for words in X
])
"""
在 classifier.py 中,我正在运行。
nb_pipeline = Pipeline([
('NBCV',FeaturesSelection.w2v),
('nb_clf',MultinomialNB())])
然而这并没有奏效,我得到了以下错误:
TypeError Traceback (most recent call last)
<ipython-input-14-07045943a69c> in <module>
2 nb_pipeline = Pipeline([
3 ('NBCV',FeaturesSelection.w2v),
----> 4 ('nb_clf',MultinomialNB())])
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
71 FutureWarning)
72 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73 return f(**kwargs)
74 return inner_f
75
/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in __init__(self, steps, memory, verbose)
112 self.memory = memory
113 self.verbose = verbose
--> 114 self._validate_steps()
115
116 def get_params(self, deep=True):
/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _validate_steps(self)
160 "transformers and implement fit and transform "
161 "or be the string 'passthrough' "
--> 162 "'%s' (type %s) doesn't" % (t, type(t)))
163
164 # We allow last estimator to be None as an identity transformation
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '{' ': array([-0.17019527, 0.32363772, -0.0770281 , -0.0278154 , -0.05182227, ....
我正在使用该文件夹中的所有程序,因此如果您也使用它们,代码将能够再现。
如果您能够向我解释如何修复以及代码中需要进行哪些其他更改,那就太好了。我的目标是使用BoW、TF-IDF和Word2Vec比较模型(朴素贝叶斯、随机森林等)。
更新:
在下面的回答(来自Ismail)之后,我根据以下方式更新了代码:
class MeanEmbeddingVectorizer(object):
def __init__(self, word2vec, size=100):
self.word2vec = word2vec
self.dim = size
和
#building Linear SVM classfier
svm_pipeline = Pipeline([
('svmCV',FeaturesSelection_W2V.MeanEmbeddingVectorizer(FeaturesSelection_W2V.w2v)),
('svm_clf',svm.LinearSVC())
])
svm_pipeline.fit(DataPrep.train_news['Statement'], DataPrep.train_news['Label'])
predicted_svm = svm_pipeline.predict(DataPrep.test_news['Statement'])
np.mean(predicted_svm == DataPrep.test_news['Label'])
然而,我仍然遇到错误。
因为有太多的代码
。请考虑使用 [ask] 和 [reprex]。另外,你的 sklearn transformer 应该继承BaseEstimator
和TransformerMixin
,以便在 sklearn pipeline 中运行,但我不知道是否足以让你的程序执行,因为我不知道如何运行它。 - Sergey BushmanovPipeline
的每个步骤都必须实现fit
和transform
。相反,您正在传递一个dict
。但这是您的最后一个问题,因为您定义w2v
的代码一开始就没有任何意义。首先,您加载了预训练向量到一个字典中。然后,您定义了一个 Word2Vec 模型。接下来,您创建了一个字典,将 Glove 文件中的单词与模型的未经训练的向量进行了压缩。 (.-.) - janluke