如何使用权重在逻辑回归中获取特征重要性？

Question

如何使用权重在逻辑回归中获取特征重要性？

machine-learningscikit-learnlogistic-regressionsklearn-pandas

5

我有一个评论数据集，其中有正面/负面的类标签。我正在将逻辑回归应用于该评论数据集。首先，我将其转换为词袋模型。这里sorted_data ['Text']是评论，而final_counts是一个稀疏矩阵。

count_vect = CountVectorizer() 
final_counts = count_vect.fit_transform(sorted_data['Text'].values)
standardized_data = StandardScaler(with_mean=False).fit_transform(final_counts)

将数据集拆分为训练集和测试集。

X_1, X_test, y_1, y_test = cross_validation.train_test_split(final_counts, labels, test_size=0.3, random_state=0)
X_tr, X_cv, y_tr, y_cv = cross_validation.train_test_split(X_1, y_1, test_size=0.3)

我将按以下方式应用逻辑回归算法。

optimal_lambda = 0.001000
log_reg_optimal = LogisticRegression(C=optimal_lambda)

# fitting the model
log_reg_optimal.fit(X_tr, y_tr)

# predict the response
pred = log_reg_optimal.predict(X_test)

# evaluate accuracy
acc = accuracy_score(y_test, pred) * 100
print('\nThe accuracy of the Logistic Regression for C = %f is %f%%' % (optimal_lambda, acc))

我的权重是

weights = log_reg_optimal.coef_ .   #<class 'numpy.ndarray'>

array([[-0.23729528, -0.16050616, -0.1382504 , ...,  0.27291847,
         0.35857267,  0.41756443]])
(1, 38178) #shape of weights

我希望获取特征重要性，即具有高权重的前100个特征。有人可以告诉我如何获取它们吗？

- merkle

为什么不对“weights”的绝对值进行排序，然后保留前100个呢？ - seralouk

是的，我还需要前100个具有高权重的单词。 - merkle

2个回答

3

如果您正在使用逻辑回归模型，那么可以使用递归特征消除（RFE）方法从预测器列表中选择重要特征并过滤掉冗余特征。该功能在scikit-learn库中提供。您可以参考以下链接获取详细信息: https://machinelearningmastery.com/feature-selection-machine-learning-python/ 该方法将根据重要性对特征进行排名，您可以选择需要进一步分析的前n个特征。

- Gaurav Sitaula

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- seralouk · Accepted Answer

研究线性分类模型中给定特征/参数的"影响"或"重要性"的一种方法是考虑系数的大小。

这是最基本的方法。其他找到特征重要性或参数影响的技术可能会提供更多的见解，例如使用p值、自举分数、各种"判别指数"等。

在此，您已标准化数据，因此请直接使用此数据:

weights = log_reg_optimal.coef_
abs_weights = np.abs(weights)

print(abs_weights)

如果您查看原始的weights，那么负系数意味着相应特征的较高值将更多地推向负类分类。

编辑1

示例显示如何获取特征名称：

import numpy as np

#features names
names_of_variables =np.array(['a','b','c','d'])

#create random weights and get the magnitude
weights = np.random.rand(4)
abs_weights = np.abs(weights)

#get the sorting indices
sorted_index = np.argsort(abs_weights)[::-1]

#check if the sorting indices are correct
print(abs_weights[sorted_index])

#get the index of the top-2 features
top_2 = sorted_index[:2]

#get the names of the top 2 most important features
print(names_of_variables[top_2])