将系数分割成适用于多类的数组

4
我使用这个函数绘制每个标签的最佳和最差特征(coef)。
 def plot_coefficients(classifier, feature_names, top_features=20):
     coef = classifier.coef_.ravel()
     for i in np.split(coef,6): 
        top_positive_coefficients = np.argsort(i)[-top_features:]
        top_negative_coefficients = np.argsort(i)[:top_features]
        top_coefficients = np.hstack([top_negative_coefficients, top_positive_coefficients])
     # create plot
     plt.figure(figsize=(15, 5))
     colors = ["red" if c < 0 else "blue" for c in i[top_coefficients]]
     plt.bar(np.arange(2 * top_features), i[top_coefficients], color=colors)
     feature_names = np.array(feature_names)
     plt.xticks(np.arange(1, 1 + 2 * top_features), feature_names[top_coefficients], rotation=60, ha="right")
     plt.show()

将其应用于sklearn.LinearSVC:

if (name == "LinearSVC"):   
    print(clf.coef_)
    print(clf.intercept_)
    plot_coefficients(clf, cv.get_feature_names())

使用的CountVectorizer的维度为(15258, 26728)。 这是一个具有6个标签的多类决策问题。使用.ravel返回长度为6*26728=160368的平坦数组。这意味着所有高于26728的索引都超出了轴1的范围。以下是一个标签的顶部和底部索引:

i[ 0. 0. 0.07465654 ... -0.02112607  0. -0.13656274]
Top [39336 35593 29445 29715 36418 28631 28332 40843 34760 35887 48455 27753
 33291 54136 36067 33961 34644 38816 36407 35781]

i[ 0. 0. 0.07465654 ... -0.02112607  0. -0.13656274]
Bot [39397 40215 34521 39392 34586 32206 36526 42766 48373 31783 35404 30296
 33165 29964 50325 53620 34805 32596 34807 40895]

"top"列表中第一个条目的索引为39336。这与词汇表中的条目39337-26728=12608相对应。我需要更改代码中的什么内容才能使其适用?
X_train = sparse.hstack([training_sentences,entities1train,predictionstraining_entity1,entities2train,predictionstraining_entity2,graphpath_training,graphpathlength_training])
y_train = DFTrain["R"]


X_test = sparse.hstack([testing_sentences,entities1test,predictionstest_entity1,entities2test,predictionstest_entity2,graphpath_testing,graphpathlength_testing])
y_test = DFTest["R"]

尺寸: (15258, 26728) (15258, 26728) (0, 0) 1 ... (15257, 0) 1 (15258, 26728) (0, 0) 1 ... (15257, 0) 1 (15258, 26728) (15258L, 1L)

以上是关于尺寸的信息,具体值如上所示。
File "TwoFeat.py", line 708, in plot_coefficients
colors = ["red" if c < 0 else "blue" for c in i[top_coefficients]]
MemoryError

1
你尝试过索引%26728吗? - Ernest S Kirubakaran
1
只需添加“top_coefficients = top_coefficients%26728”即可。哈哈,谢谢。 - Mi.
1个回答

2

首先,你是否必须使用ravel()

LinearSVC(或实际上任何具有coef_的其他分类器)以以下形式输出coef_

coef_ : array, shape = [n_features] if n_classes == 2 else [n_classes, n_features]

    Weights assigned to the features (coefficients in the primal problem).

因此,该矩阵的行数等于类别数,列数等于特征数。对于每个类别,您只需要访问正确的行。类别的顺序将从classifier.classes_属性中获取。

其次,您的代码缩进有误。应在for循环内部放置plot代码以绘制每个类别的图像。目前它在for循环范围之外,所以只会绘制最后一个类别的图像。

纠正这两个问题后,以下是一个可重现的示例代码,用于绘制每个类别的前几个和后几个特征。

def plot_coefficients(classifier, feature_names, top_features=20):

    # Access the coefficients from classifier
    coef = classifier.coef_

    # Access the classes
    classes = classifier.classes_

    # Iterate the loop for number of classes
    for i in range(len(classes)):


        print(classes[i])

        # Access the row containing the coefficients for this class
        class_coef = coef[i]


        # Below this, I have just replaced 'i' in your code with 'class_coef'
        # Pass this to get top and bottom features
        top_positive_coefficients = np.argsort(class_coef)[-top_features:]
        top_negative_coefficients = np.argsort(class_coef)[:top_features]

        # Concatenate the above two 
        top_coefficients = np.hstack([top_negative_coefficients, 
                                      top_positive_coefficients])
        # create plot
        plt.figure(figsize=(10, 3))

        colors = ["red" if c < 0 else "blue" for c in class_coef[top_coefficients]]
        plt.bar(np.arange(2 * top_features), class_coef[top_coefficients], color=colors)
        feature_names = np.array(feature_names)

        # Here I corrected the start to 0 (Your code has 1, which shifted the labels)
        plt.xticks(np.arange(0, 1 + 2 * top_features), 
                   feature_names[top_coefficients], rotation=60, ha="right")
        plt.show()

现在你可以根据自己的喜好使用这个方法:
import numpy as np
from matplotlib import pyplot as plt
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC

categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space']

dataset = fetch_20newsgroups(subset='all', categories=categories,
                             shuffle=True, random_state=42)
vectorizer = CountVectorizer()




# Just to replace classes from integers to their actual labels, 
# you can use anything as you like in y
y = []
mapping_dict = dict(enumerate(dataset.target_names))
for i in dataset.target:
    y.append(mapping_dict[i])

# Learn the words from data
X = vectorizer.fit_transform(dataset.data)

clf = LinearSVC(random_state=42)
clf.fit(X, y)

plot_coefficients(clf, vectorizer.get_feature_names())

以上代码的输出结果:
'alt.atheism' 'alt.atheism' 'comp.graphics' 'comp.graphics' 'sci.space' 'sci.space' 'talk.religion.misc' 'talk.religion.misc'

这非常有用。虽然我不能运行它,因为如果我删除.ravel,我要么得到 IndexError: index 36205 is out of bounds for axis 1 with size 26728,要么得到一个普通的 MemoryError - Mi.
@ThelMi 我认为使用ravel()或不使用都无法避免内存错误。至于索引错误,请发布完整的错误堆栈跟踪和完整代码。 - Vivek Kumar
我至少能够使用.ravel()运行程序。我编辑了我的帖子,尽管我不确定这会改变什么。完整的代码将超过1000行,其中大部分与此问题无关。 - Mi.

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接