Scikit-learn亲和力传播算法记忆错误

9

我想我已经知道我的答案了,但是有很多比我更聪明和经验丰富的人,所以我想问一下。

当我尝试将我的hash_matrix (<class 'scipy.sparse.csr.csr_matrix'>)适配到AffinityPropagation时,我遇到了MemoryError。它仅在10,000个样本上失败,这在我的实际数据集范围内相对较小。

我的问题:我喜欢在较小的数据集上看到的AffinityPropagation的结果,但是除非我能够将其应用于我的大型数据集,否则它是无用的。

我的问题:在标准笔记本电脑上尝试适配数十万个项目到AffinityPropagation是否不太可能发生?

我了解到的:

  1. AffinityPropagation 不支持 partial_fit 和增量学习。
  2. 时间复杂度AffinityPropagation的主要缺点。
  3. Affinity Propagation [is] most appropriate for small to medium sized datasets.

抛出的错误:

Traceback (most recent call last):
  File "C:/Users/my.name/Documents/my files/Programs/clustering_test/SOexample.py", line 68, in <module>
    aff.fit(hash_matrix)
  File "C:\Python34\lib\site-packages\sklearn\cluster\affinity_propagation_.py", line 301, in fit
    copy=self.copy, verbose=self.verbose, return_n_iter=True)
  File "C:\Python34\lib\site-packages\sklearn\cluster\affinity_propagation_.py", line 105, in affinity_propagation
    S += ((np.finfo(np.double).eps * S + np.finfo(np.double).tiny * 100) *
MemoryError

完整的可工作代码示例:
import pandas as pd
import numpy as np

from nltk.stem import PorterStemmer

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn import cluster

data = ['10 news headlines', '3 current events in the news today',
 '5 day break in new york', '7 breaking news', '7 news breaking news',
 '7 news headlines', '7 news online', '7 news today', 'america current news',
 'america new york time', 'america news latest', 'america news online',
 'america news paper', 'america news today', 'america recent news',
 'american news channel', 'american news channels', 'any news today',
 'article about new york', 'article about new york city',
 'article in newspaper today', 'article news today', 'article today news',
 'articles about new york', 'articles on new york', 'articles usa',
 'best news channel', 'best news homepage', 'best newspaper websites',
 'big news stories', 'big news stories of 2013', 'break in new york',
 'break news today', 'break to new york', 'breaking cnn news',
 'breaking entertainment news', 'breaking global news', 'breaking headlines',
 'breaking international news', 'breaking international news today',
 'breaking latest news', 'breaking nation news', 'breaking new cnn',
 'breaking new for today', 'breaking new of today', 'breaking new today',
 'breaking news', 'breaking news america today',
 'breaking news and top stories', 'breaking news around the world',
 'breaking news around the world today', 'breaking news brooklyn',
 'breaking news cnn', 'breaking news cnn alerts', 'breaking news cnn live',
 'world important news today', 'world latest breaking news',
 'world latest news', 'world latest news headlines', 'world latest news today',
 'world latest news update', 'world latest news updates', 'world new headlines',
 'world new now', 'world new today', 'world news', 'world news articles',
 'world news articles today', 'world news breaking',
 'world news breaking headlines', 'world news cnn today',
 'world news current events', 'world news events', 'world news for this week',
 'world news for today', 'world news headline', 'world news headlines',
 'world news headlines daily nation', 'world news headlines today live',
 'world news highlights', 'world news latest headlines', 'world news now',
 'world news now cnn', 'world news recent', 'world news report',
 'world news sites', 'world news sources', 'world news stories',
 'world news today', 'world news today 2014', 'world news today cnn',
 'world news today headlines', 'world news today live',
 'world news today video', 'world news update', 'world news update today',
 'world news updates', 'world news updates today', 'world news video',
 'world news videos', 'world news website', 'world news websites',
 'world newspaper', 'world newspaper articles', 'world newspaper online',
 'world recent news', 'world times news', 'world today news', 'world top news',
 'world top news today', 'world updated news', 'world wide latest news',
 'world wide news today', 'worlds news', 'worlds news headlines',
 'worlds news today', 'worldwide breaking news', 'worldwide news today',
 'www.headline news today', 'www.headlines news', 'www.news headlines today',
 'www.news today.in', 'www.today news paper.com', 'www.todays news headlines',
 'www.todays news headlines.com', 'www.todays news.com',
 'www.world latest news']

#data = pd.read_csv('myfile.csv')['SomeColumn'].drop_duplicates().reset_index(drop=True).to_frame()[:10000]
#data.columns = ['Keyword']
#data = data['Keyword'].tolist()

stemmer = PorterStemmer()

stemmed_data = [stemmer.stem_word(word) for word in data]

hasher = HashingVectorizer(stop_words='english', ngram_range=(1,2), analyzer='word')
hash_matrix = hasher.transform(stemmed_data)

aff = cluster.AffinityPropagation()
aff.fit(hash_matrix)

df = pd.DataFrame({'Keyword': data, 'Cluster': aff.labels_.tolist()})
grouped = df.groupby('Cluster').apply(lambda frame: frame['Keyword']).reset_index(1, drop=True).to_frame('Keyword')

1
不知道为什么有人会踩这个问题。这个问题很好,而且提问得很清晰。 - Jarad
1个回答

8
亲和传播需要二次存储空间以存储完整的距离矩阵。
因此,如果您有10000个样本,并且使用双精度浮点数,您需要大约800,000,000字节的存储空间。如果在某个时候需要复制此矩阵,则至少需要1.6 GB的RAM(不包括输入数据和任何开销)。
如果您想要处理“数十万”个样本,那么您需要至少再增加100倍的存储空间,即80到160 GB的RAM。

这是否意味着即使对于仅有10000行的数据集,我们也不能使用亲和传播算法,或者有一些解决方法? - Ketan Sahu

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接