使用scikit-learn进行多类别分类

82
我想使用scikit-learn中的监督学习方法将文本分类到一个或多个类别中。我尝试使用所有算法的预测函数,但只返回一个匹配项。
例如,我有一段文本:
"Theaters in New York compared to those in London"

我已经训练好了算法,可以为我输入的每个文本片段选择一个地点。

在上面的例子中,我希望它返回New YorkLondon,但它只返回New York

是否可能使用scikit-learn返回多个结果?甚至返回下一个最高概率的标签?

谢谢你的帮助。

---更新

我尝试使用OneVsRestClassifier,但每个文本片段仍然只返回一个选项。以下是我使用的示例代码:

y_train = ('New York','London')


train_set = ("new york nyc big apple", "london uk great britain")
vocab = {'new york' :0,'nyc':1,'big apple':2,'london' : 3, 'uk': 4, 'great britain' : 5}
count = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=2),vocabulary=vocab)
test_set = ('nice day in nyc','london town','hello welcome to the big apple. enjoy it here and london too')

X_vectorized = count.transform(train_set).todense()
smatrix2  = count.transform(test_set).todense()


base_clf = MultinomialNB(alpha=1)

clf = OneVsRestClassifier(base_clf).fit(X_vectorized, y_train)
Y_pred = clf.predict(smatrix2)
print Y_pred

结果: ['New York' 'London' 'London']

5个回答

111
您需要的是称为多标签分类的东西。Scikits-learn可以做到。看这里:http://scikit-learn.org/dev/modules/multiclass.html。 我不确定在您的示例中出了什么问题,我的sklearn版本显然没有WordNGramAnalyzer。也许这是使用更多训练样本或尝试不同的分类器的问题?但请注意,多标签分类器希望目标是标签的元组/列表的列表。以下对我有效:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york"])
y_train = [[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],[0,1],[0,1]]
X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'hello welcome to new york. enjoy it here and london too'])   
target_names = ['New York', 'London']

classifier = Pipeline([
    ('vectorizer', CountVectorizer(min_n=1,max_n=2)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
for item, labels in zip(X_test, predicted):
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))

对我来说,这会产生以下输出:

nice day in nyc => New York
welcome to london => London
hello welcome to new york. enjoy it here and london too => New York, London

1
我尝试删除最后两个结合城市名称的训练示例,结果如下:hello welcome to new york. enjoy it here and london too => New York现在它不再返回两个标签。对我来说,只有在训练这两个城市的组合时才会返回两个标签。 我是否漏掉了什么? 再次感谢您的所有帮助。 - CodeMonkeyB
1
这只是一个玩具数据集,我不会从中得出太多结论。你尝试在真实数据上运行这个程序了吗? - mwv
4
@CodeMonkeyB:你应该真的接受这个答案,从编程的角度来看是正确的。它是否在实践中起作用取决于你的数据,而不是代码。 - Fred Foo
3
有人遇到min_nmax_n的问题吗?我需要将它们改成ngram_range=(1,2)才能使其正常工作。 - emmagras
1
这种方法不再适用 ValueError: 您似乎正在使用传统的多标签数据表示方式。序列的序列不再受支持;请改用二进制数组或稀疏矩阵。 - sariii
显示剩余11条评论

61

编辑:根据建议更新了Python 3、scikit-learn 0.18.1版本,并使用MultiLabelBinarizer进行改进。

我也一直在研究这个问题,并对mwv的优秀答案进行了轻微的改进,可能会有所帮助。它以文本标签作为输入,而不是二进制标签,并使用MultiLabelBinarizer进行编码。

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"],
                ["new york"],["london"],["london"],["london"],["london"],
                ["london"],["london"],["new york","london"],["new york","london"]]

X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'london is rainy',
                   'it is raining in britian',
                   'it is raining in britian and the big apple',
                   'it is raining in britian and nyc',
                   'hello welcome to new york. enjoy it here and london too'])
target_names = ['New York', 'London']

mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(y_train_text)

classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
all_labels = mlb.inverse_transform(predicted)

for item, labels in zip(X_test, all_labels):
    print('{0} => {1}'.format(item, ', '.join(labels)))

这使我得到以下输出:

nice day in nyc => new york
welcome to london => london
london is rainy => london
it is raining in britian => london
it is raining in britian and the big apple => new york
it is raining in britian and nyc => london, new york
hello welcome to new york. enjoy it here and london too => london, new york

13
labelBinarizer已经过时,请使用lb = preprocessing.MultiLabelBinarizer()代替。 - Roman
2
根据scikit-learn的说法,除了sklearn.svm.SVC之外,所有线性模型都支持One-Vs-All,而决策树、随机森林和最近邻居支持多标签。因此,我不会在这种类型的任务(即多标签分类,我假设您想使用它)中使用LinearSVC()。 - PeterB
2
值得一提的是,@mindstorm所说的One-Vs-All对应于scikit-learn类"OneVsRestClassifier"(注意不是"all"而是"Rest")。此scikit-learn帮助页面对此进行了澄清。 - lucid_dreamer
1
正如@mindstorm所提到的,确实在这个页面上,文档中提到:“One-Vs-All:所有线性模型,除了sklearn.svm.SVC”。然而,在scikit-learn文档的另一个多标签示例中,有这样一行代码classif = OneVsRestClassifier(SVC(kernel='linear'))。感到困惑。 - lucid_dreamer
1
有没有办法让它不处理标签?例如,如果我添加一个测试例子“我想要饼干”,它会将其标记为“纽约”和“伦敦”。 - Omar Meky
显示剩余4条评论

8

我也遇到了这个问题,我的 y_Train 是一系列字符串而不是字符串的序列。显然,OneVsRestClassifier 根据输入标签的格式决定使用多类别还是多标签。所以请更改代码:

y_train = ('New York','London')

to

y_train = (['New York'],['London'])

显然,在将来,这将会消失,因为如果所有标签都相同,则会断开连接: https://github.com/scikit-learn/scikit-learn/pull/1987


8

将此行更改以使其在新版本的Python中正常工作

# lb = preprocessing.LabelBinarizer()
lb = preprocessing.MultiLabelBinarizer()

2

以下是一些多分类示例:

示例1:

import numpy as np
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()

arr2d = np.array([1, 2, 3,4,5,6,7,8,9,10,11,12,13,14,1])
transfomed_label = encoder.fit_transform(arr2d)
print(transfomed_label)

输出结果是

[[1 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0]]

Example 2:-

import numpy as np
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()

arr2d = np.array(['Leopard','Lion','Tiger', 'Lion'])
transfomed_label = encoder.fit_transform(arr2d)
print(transfomed_label)

输出结果为:

[[1 0 0]
 [0 1 0]
 [0 0 1]
 [0 1 0]]

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接