请审查Python代码以提高其性能。

4
我正在进行信息检索任务。我建立了一个简单的搜索引擎。倒排索引是一个Python字典对象,被序列化(在Python术语中称为pickled)到一个文件中。这个文件的大小只有6.5MB。
因此,我的代码只需反序列化它并搜索查询,并根据TF-IDF得分对匹配文档进行排序。听起来并不算什么大事吧?
30分钟前它开始运行,现在仍在运行。我的100行Python脚本运行的pythonw.exe的私有字节和虚拟大小使用量分别为88MB和168MB。
当我尝试使用较小的索引时,速度很快。是Python还是我的代码?为什么这么慢?
stopwords = ['a' , 'a\'s' , 'able' , 'about' , 'above' , 'according' , 'accordingly' , 'across' , 'actually' , 'after' , 'afterwards' , 'again' , 'against' , 'ain\'t' , 'all' , 'allow' , 'allows' , 'almost' , 'alone' , 'along' , 'already' , 'also' , 'although' , 'always' , 'am' , 'among' , 'amongst' , 'an' , 'and' , 'another' , 'any' , 'anybody' , 'anyhow' , 'anyone' , 'anything' , 'anyway' , 'anyways' , 'anywhere' , 'apart' , 'appear' , 'appreciate' , 'appropriate' , 'are' , 'aren\'t' , 'around' , 'as' , 'aside' , 'ask' , 'asking' , 'associated' , 'at' , 'available' , 'away' , 'awfully' , 'b' , 'be' , 'became' , 'because' , 'become' , 'becomes' , 'becoming' , 'been' , 'before' , 'beforehand' , 'behind' , 'being' , 'believe' , 'below' , 'beside' , 'besides' , 'best' , 'better' , 'between' , 'beyond' , 'both' , 'brief' , 'but' , 'by' , 'c' , 'c\'mon' , 'c\'s' , 'came' , 'can' , 'can\'t' , 'cannot' , 'cant' , 'cause' , 'causes' , 'certain' , 'certainly' , 'changes' , 'clearly' , 'co' , 'com' , 'come' , 'comes' , 'concerning' , 'consequently' , 'consider' , 'considering' , 'contain' , 'containing' , 'contains' , 'corresponding' , 'could' , 'couldn\'t' , 'course' , 'currently' , 'd' , 'definitely' , 'described' , 'despite' , 'did' , 'didn\'t' , 'different' , 'do' , 'does' , 'doesn\'t' , 'doing' , 'don\'t' , 'done' , 'down' , 'downwards' , 'during' , 'e' , 'each' , 'edu' , 'eg' , 'eight' , 'either' , 'else' , 'elsewhere' , 'enough' , 'entirely' , 'especially' , 'et' , 'etc' , 'even' , 'ever' , 'every' , 'everybody' , 'everyone' , 'everything' , 'everywhere' , 'ex' , 'exactly' , 'example' , 'except' , 'f' , 'far' , 'few' , 'fifth' , 'first' , 'five' , 'followed' , 'following' , 'follows' , 'for' , 'former' , 'formerly' , 'forth' , 'four' , 'from' , 'further' , 'furthermore' , 'g' , 'get' , 'gets' , 'getting' , 'given' , 'gives' , 'go' , 'goes' , 'going' , 'gone' , 'got' , 'gotten' , 'greetings' , 'h' , 'had' , 'hadn\'t' , 'happens' , 'hardly' , 'has' , 'hasn\'t' , 'have' , 'haven\'t' , 'having' , 'he' , 'he\'s' , 'hello' , 'help' , 'hence' , 'her' , 'here' , 'here\'s' , 'hereafter' , 'hereby' , 'herein' , 'hereupon' , 'hers' , 'herself' , 'hi' , 'him' , 'himself' , 'his' , 'hither' , 'hopefully' , 'how' , 'howbeit' , 'however' , 'i' , 'i\'d' , 'i\'ll' , 'i\'m' , 'i\'ve' , 'ie' , 'if' , 'ignored' , 'immediate' , 'in' , 'inasmuch' , 'inc' , 'indeed' , 'indicate' , 'indicated' , 'indicates' , 'inner' , 'insofar' , 'instead' , 'into' , 'inward' , 'is' , 'isn\'t' , 'it' , 'it\'d' , 'it\'ll' , 'it\'s' , 'its' , 'itself' , 'j' , 'just' , 'k' , 'keep' , 'keeps' , 'kept' , 'know' , 'knows' , 'known' , 'l' , 'last' , 'lately' , 'later' , 'latter' , 'latterly' , 'least' , 'less' , 'lest' , 'let' , 'let\'s' , 'like' , 'liked' , 'likely' , 'little' , 'look' , 'looking' , 'looks' , 'ltd' , 'm' , 'mainly' , 'many' , 'may' , 'maybe' , 'me' , 'mean' , 'meanwhile' , 'merely' , 'might' , 'more' , 'moreover' , 'most' , 'mostly' , 'much' , 'must' , 'my' , 'myself' , 'n' , 'name' , 'namely' , 'nd' , 'near' , 'nearly' , 'necessary' , 'need' , 'needs' , 'neither' , 'never' , 'nevertheless' , 'new' , 'next' , 'nine' , 'no' , 'nobody' , 'non' , 'none' , 'noone' , 'nor' , 'normally' , 'not' , 'nothing' , 'novel' , 'now' , 'nowhere' , 'o' , 'obviously' , 'of' , 'off' , 'often' , 'oh' , 'ok' , 'okay' , 'old' , 'on' , 'once' , 'one' , 'ones' , 'only' , 'onto' , 'or' , 'other' , 'others' , 'otherwise' , 'ought' , 'our' , 'ours' , 'ourselves' , 'out' , 'outside' , 'over' , 'overall' , 'own' , 'p' , 'particular' , 'particularly' , 'per' , 'perhaps' , 'placed' , 'please' , 'plus' , 'possible' , 'presumably' , 'probably' , 'provides' , 'q' , 'que' , 'quite' , 'qv' , 'r' , 'rather' , 'rd' , 're' , 'really' , 'reasonably' , 'regarding' , 'regardless' , 'regards' , 'relatively' , 'respectively' , 'right' , 's' , 'said' , 'same' , 'saw' , 'say' , 'saying' , 'says' , 'second' , 'secondly' , 'see' , 'seeing' , 'seem' , 'seemed' , 'seeming' , 'seems' , 'seen' , 'self' , 'selves' , 'sensible' , 'sent' , 'serious' , 'seriously' , 'seven' , 'several' , 'shall' , 'she' , 'should' , 'shouldn\'t' , 'since' , 'six' , 'so' , 'some' , 'somebody' , 'somehow' , 'someone' , 'something' , 'sometime' , 'sometimes' , 'somewhat' , 'somewhere' , 'soon' , 'sorry' , 'specified' , 'specify' , 'specifying' , 'still' , 'sub' , 'such' , 'sup' , 'sure' , 't' , 't\'s' , 'take' , 'taken' , 'tell' , 'tends' , 'th' , 'than' , 'thank' , 'thanks' , 'thanx' , 'that' , 'that\'s' , 'thats' , 'the' , 'their' , 'theirs' , 'them' , 'themselves' , 'then' , 'thence' , 'there' , 'there\'s' , 'thereafter' , 'thereby' , 'therefore' , 'therein' , 'theres' , 'thereupon' , 'these' , 'they' , 'they\'d' , 'they\'ll' , 'they\'re' , 'they\'ve' , 'think' , 'third' , 'this' , 'thorough' , 'thoroughly' , 'those' , 'though' , 'three' , 'through' , 'throughout' , 'thru' , 'thus' , 'to' , 'together' , 'too' , 'took' , 'toward' , 'towards' , 'tried' , 'tries' , 'truly' , 'try' , 'trying' , 'twice' , 'two' , 'u' , 'un' , 'under' , 'unfortunately' , 'unless' , 'unlikely' , 'until' , 'unto' , 'up' , 'upon' , 'us' , 'use' , 'used' , 'useful' , 'uses' , 'using' , 'usually' , 'uucp' , 'v' , 'value' , 'various' , 'very' , 'via' , 'viz' , 'vs' , 'w' , 'want' , 'wants' , 'was' , 'wasn\'t' , 'way' , 'we' , 'we\'d' , 'we\'ll' , 'we\'re' , 'we\'ve' , 'welcome' , 'well' , 'went' , 'were' , 'weren\'t' , 'what' , 'what\'s' , 'whatever' , 'when' , 'whence' , 'whenever' , 'where' , 'where\'s' , 'whereafter' , 'whereas' , 'whereby' , 'wherein' , 'whereupon' , 'wherever' , 'whether' , 'which' , 'while' , 'whither' , 'who' , 'who\'s' , 'whoever' , 'whole' , 'whom' , 'whose' , 'why' , 'will' , 'willing' , 'wish' , 'with' , 'within' , 'without' , 'won\'t' , 'wonder' , 'would' , 'would' , 'wouldn\'t' , 'x' , 'y' , 'yes' , 'yet' , 'you' , 'you\'d' , 'you\'ll' , 'you\'re' , 'you\'ve' , 'your' , 'yours' , 'yourself' , 'yourselves' , 'z' , 'zero']
import PorterStemmer
import math
import pickle

def TF(term,doc):
    #Term Frequency: No. of times `term` occured in `doc`
    global InvertedIndex
    idx = InvertedIndex[term].index(doc)
    count = 0
    while (idx < len(InvertedIndex[term])) and InvertedIndex[term][idx] == doc:
        count= count+1
        idx = idx+1
    return count

def DF(term):
    #Document Frequency: No. of documents containing `term`
    global InvertedIndex
    return len(set(InvertedIndex[term]))

def avgTF(term, doc):
    global docs
    TFs = []
    for term in docs[doc]:
        TFs.append(TF(term,doc))
    return sum(TFs)/len(TFs)

def maxTF(term, doc):
    global docs
    TFs = []    
    for term in docs[doc]:
        TFs.append(TF(term,doc))
    return max(TFs)

def getValues4Term(term, doc):
    TermFrequency = {}
    TermFrequency['natural'] = TF(term,doc)
    TermFrequency['log'] = 1+math.log( TF(term,doc) )
    TermFrequency['aug'] = 0.5+float(0.5*TF(term,doc)/maxTF(term,doc))
    TermFrequency['bool'] = 1 if TF(term,doc)>0 else 0
    TermFrequency['log_avg'] = float(1+math.log( TF(term,doc) ))/(1+math.log( avgTF(term,doc) ))

    DocumentFrequency = {}
    DocumentFrequency['no'] = 1
    DocumentFrequency['idf'] = math.log( len(docs)/DF(term) )
    DocumentFrequency['probIDF'] = max( [0, math.log( float(len(docs)-DF(term))/DF(term) )] )
    return [TermFrequency, DocumentFrequency]

def Cosine(resultDocVector, qVector, doc):
    #`doc` parameter is the document number corresponding to resultDocVector
    global qterms,docs
    # Defining Cosine similarity : cos(a) = A.B/|A||B|

    dotProduct = 0
    commonTerms_q_d = set(qterms).intersection(docs[doc]) #commonTerms in both query & document
    for cmnTerm in commonTerms_q_d:
       dotProduct =  dotProduct + resultDocVector[docs[doc].index(cmnTerm)] * qVector[qterms.index(cmnTerm)]

    resultSquares = []
    for k in resultDocVector:
        resultSquares.append(k*k)

    qSquares = []
    for k in qVector:
        qSquares.append(k*k)

    denominator = math.sqrt(sum(resultSquares)) * math.sqrt(sum(qSquares))
    return dotProduct/denominator

def load():
    #load index from a file
    global InvertedIndex, docIDs, docs
    PICKLE_InvertedIndex_FILE = open("InvertedIndex.db", 'rb')
    InvertedIndex = pickle.load(PICKLE_InvertedIndex_FILE)
    PICKLE_InvertedIndex_FILE.close()

    PICKLE_docIDs_FILE = open("docIDs.db", 'rb')
    docIDs = pickle.load(PICKLE_docIDs_FILE)
    PICKLE_docIDs_FILE.close()

    PICKLE_docs_FILE = open("docs.db", 'rb')
    docs = pickle.load(PICKLE_docs_FILE)
    PICKLE_docs_FILE.close()
########################
docs = []
docIDs = []
InvertedIndex = {}
load()

stemmer = PorterStemmer.PorterStemmer()
#<getting results for a query
query = 'Antarctica exploration'
qwords = query.strip().split()
qterms = []
qterms1 = []
for qword in qwords:
    qword = qword.lower()
    if qword in stopwords:
        continue
    qterm = stemmer.stem(qword,0,len(qword)-1) 
    qterms1.append(qterm)
qterms = list(set(qterms1))


#getting posting lists for each qterms & merging them
prev = set()
i = 0
for qterm in qterms:
    if InvertedIndex.has_key(qterm):
        if i == 0:
            prev = set(InvertedIndex[qterm])
            i = i+1
            continue
        prev = prev.intersection(set(InvertedIndex[qterm]))

results = list(prev)
#</getting results for a query

#We've got the results. Now lets rank them using Cosine similarity.
i = 0
docComponents = []
for doc in results:
        docComponents.append([])

i = 0    
for doc in results:
    for term in docs[doc]:
        vals = getValues4Term(term,doc)#[TermFrequency, DocumentFrequency]
        docComponents[i].append(vals)
    i = i+1
#Normalization = {}

# forming vectors for each document in the result
i = 0 #document iterator
j = 0 #term iterator
resultDocVectors = []#contains document vector for each result.

for doc in results:
        resultDocVectors.append([])

for i in range(0,len(results)):
    for j in range(0,len(docs[doc])):
        tf = docComponents[i][j][0]['natural']#0:TermFrequency
        idf = docComponents[i][j][1]['idf'] #1:DocumentFrequency        
        resultDocVectors[i].append(tf*idf)

#forming vector for query
qVector = []
qTF = []
qDF = []
for qterm in qterms:
    count = 0
    idx = qterms1.index(qterm)
    while idx < len(qterms1) and qterms1[idx] == qterm:
        count= count+1
        idx = idx+1
    qTF.append(count)
qVector = qTF    


#compuing Cosine similarities of all resultDocVectors w.r.t qVector
i = 0
CosineVals = []
for resultDocVector in resultDocVectors:
    doc = results[i]
    CosineVals.append(Cosine(resultDocVector, qVector, doc))
    i = i+1

#ranking as per Cosine Similarities
#this is not "perfect" sorting. As it may not give 100% correct results when it multiple docs have same cosine similarities.
CosineValsCopy = CosineVals
CosineVals.sort()
sortedCosineVals = CosineVals
CosineVals = CosineValsCopy
rankedResults = []
for cval in sortedCosineVals:    
    rankedResults.append(results[CosineVals.index(cval)])
rankedResults.reverse()

#<Evaluation of the system:>

#parsing qrels.txt & getting relevances
# qrels.txt contains columns of the form:
#       qid  iter  docno  rel
#2nd column `iter` can be ignored.
relevances = {}
fh = open("qrels.txt")
lines = fh.readlines()
for line in lines:
    cols = line.strip().split()
    if relevances.has_key(cols[0]):#queryID
        relevances[cols[0]].append(cols[2])#docID
    else:
        relevances[cols[0]] = [cols[2]]
fh.close()

#precision = no. of relevant docs retrieved/total no. of docs retrieved
no_of_relevant_docs_retrieved = set(rankedResults).intersection( set(relevances[queryID]) )
Precision = no_of_relevant_docs_retrieved/len(rankedResults)

#recall = no. of relevant docs retrieved/ total no. of relevant docs
Recall = no_of_relevant_docs_retrieved/len(relevances[queryID])

4
发布你的代码。如果你不发布它,我们怎么知道它是你的代码? - jrockway
1
从你的代码简要调查来看,很明显你不知道如何编写Python ;) - aaronasterling
3
你需要将测试案例缩小到一个相当简明、自包含的可重现性范围内。你期望别人自己找出代码中不相关的部分,并且想方设法设置数据集以便能够运行吗?请求帮助时,请先进行基本准备工作。 - Glenn Maynard
3
“我的代码太烂了,为什么Python不能让它运行更快?” - Falmarri
1
为什么,以神的名义,你要有一个余弦函数。即使是具有“重复造轮子态度”的C程序员也知道要#include <cmath.h> - Rafe Kettler
显示剩余6条评论
5个回答

18

这肯定是你的代码问题,但由于你选择隐藏它,我们无法提供更多帮助。根据你提供的非常少的信息,我可以告诉你,以正确的方式取消pickle字典的速度要快得多,并且对其进行索引(假设这就是你所说的“为查询搜索”)的速度非常快。基于这些数据,我推断出导致你的减速的原因必须是你在代码中做了其他事情或做错了其他事情。

编辑:既然你已经发布了你的代码,我注意到,乍一看,许多复杂的代码正在模块顶层运行。这真的是可怕的做法,对性能有害:将所有的复杂代码放入一个函数中,并调用该函数——仅此就会使你的速度提高几十个百分点,且不会增加任何复杂性成本。我在我的Stack Overflow文章中至少提到过这个重要的事实20次,更不用说《Python in a Nutshell》等书籍了——如果你关心性能,你不能轻率地忽视这种易于获得和广泛传播的信息!

更容易修复的浪费时间的问题:

import pickle

如果你的Python版本不是2.6或2.7,而是3.1,则要使用cPickle(可能有其他导致性能问题的原因-我不知道3.1在这个时候与2.6和2.7相比优化得多好)。

除了load中的一个之外,所有的global语句都是无用的(这不会对性能造成严重影响,但冗余和无用的代码应该被原则上消除)。只有当你想从函数内部“绑定”一个模块全局变量时才需要一个global,而load是唯一需要这样做的函数。

更多修改:

现在我们讨论更重要的内容:在InvertedIndex中的值似乎是文档列表,所以要知道文档在一个列表中出现的次数,必须遍历整个列表。为什么不将每个值改为从文档到出现次数的字典呢?没有循环(也没有循环,在您现在在InvertedIndex中执行值的len(set(...))的地方 - 只有len等效,并且可以节省隐式执行其工作的循环的set(...)操作)。这是大O优化,不仅仅是我之前提到的20%左右的速度提升 - 也就是说,这是更重要的东西,正如我所说的。使用正确的数据结构和算法,许多次要的低效率可能变得相对不重要;使用错误的数据结构和算法,无论你如何聪明地微调错误的数据结构和算法,随着输入大小的增加,代码的性能都无法挽救。;-)。

还有:您会反复计算,“每次从头开始” - 例如,请看您为每个给定的术语和文档调用TF(term, doc)的次数(每次调用都具有我刚刚解释的关键低效率)。作为快速修复这种巨大低效率的方法,可以使用记忆化 - 例如,使用memoized装饰器,可以在这里找到。

好了,对于我来说已经很晚了,我最好去睡觉了 - 希望以上建议中的一些或全部对您有所帮助!


@Alex:无论如何,彻底消除所有可能的全局变量调用不是更好吗?我没有Python的工作经验,但在PHP中也经常看到这种情况,我清楚地记得我的老板会砍掉我的脑袋,如果他看到我使用全局变量(即使是必要的)。 - erikbstack
@erikb,是的,去除全局变量确实更好,但这需要比我目前提到的更多的重构工作,而且主要是为了代码可维护性的优势,而不是性能(除非所有全局变量都可以转换为单个函数的局部变量,并作为参数传递给所有其他需要它们的函数,但这甚至需要进行更多的代码重构,以获得微小的性能提升 - 我在这个答案中主要关注性能,因为这是OP当前的关键问题。 - Alex Martelli
2
@Alex:为什么函数级别的代码比模块级别的代码执行得更快?请原谅我,我对Python内部工作机制不熟悉。 - Rafe Kettler
2
@Rafe Kettler:模块级别的变量存储在Python字典中,而函数级别的变量被内部编译成C语言的数组索引。字典检索比数组索引略慢,但它们都是O(1)级别的。 - Lie Ryan
@Alex: +1 非常感谢,它们对我都很有用。我从不知道在全局作用域编写代码会对性能产生影响。我以为函数只是用于正确组织和避免代码重复,就像面向对象编程一样。 - claws
1
@claws - 这就是它们的作用。但这意味着在Python中组织不良的代码将表现不佳,就像缩进不良的代码无法工作一样。 - Omnifarious

5
Alex给了你一些有关算法变更的好建议。我要讲解如何编写快速的Python代码。你应该两者兼顾。如果你只采用我的修改,那么根据Alex所说,你仍然会有一个错误的程序,但是我现在并不想去理解你整个程序,而且我也想微调优化。即使你最终放弃了很多这些函数,比较慢实现和快实现将帮助你编写新功能的快速实现。
接下来是一个例子函数:
def TF(term,doc):
    #Term Frequency: No. of times `term` occured in `doc`
    global InvertedIndex
    idx = InvertedIndex[term].index(doc)
    count = 0
    while (idx < len(InvertedIndex[term])) and InvertedIndex[term][idx] == doc:
        count= count+1
        idx = idx+1
    return count

将其重写为

def TF(term, doc):
    idx = InvertedIndex[term].index(doc)        
    return next(i + 1 for i, item in enumerate(InvertedIndex[term][idx:])
                if item != doc)

# Above struck out because the count method does the same thing and there was a bug
# in the implementation anyways.
InvertedIndex[term].count(doc)

这将创建一个生成器表达式,生成文档中第一个索引之后且不等于其的索引有序集合。您只需使用next函数计算第一个元素,这将成为您的count
一些您肯定想在文档中查找的函数: 一些你需要的语法:
  • 生成器表达式(与列表推导式类似,但更好(除非您需要列表;))
  • 列表推导式
最后但并非最不重要的,最重要的(IMNAHAIPSBO)Python模块: 以下是另一个函数
def maxTF(term, doc):
    global docs
    TFs = []    
    for term in docs[doc]:
        TFs.append(TF(term,doc))
    return max(TFs)

你可以使用生成器表达式重写这个代码:
def maxTF(term, doc):
    return max(TF(term, doc) for term in docs[doc])

生成器表达式通常比for循环运行速度快近一倍。
最后,这是您的Cosine函数:
def Cosine(resultDocVector, qVector, doc):
    #`doc` parameter is the document number corresponding to resultDocVector
    global qterms,docs
    # Defining Cosine similarity : cos(a) = A.B/|A||B|

    dotProduct = 0
    commonTerms_q_d = set(qterms).intersection(docs[doc]) #commonTerms in both query & document
    for cmnTerm in commonTerms_q_d:
       dotProduct =  dotProduct + resultDocVector[docs[doc].index(cmnTerm)] * qVector[qterms.index(cmnTerm)]

    resultSquares = []
    for k in resultDocVector:
        resultSquares.append(k*k)

    qSquares = []
    for k in qVector:
        qSquares.append(k*k)

让我们将其改写为:

def Cosine(resultDocVector, qVector, doc):
    doc = docs[doc]
    commonTerms_q_d = set(qterms).intersection(doc)
    dotProduct = sum(resultDocVector[doc.index(cmnTerm)] *qVector[qterms.index(cmnTerm)]
                     for cmnTerm in commonTerms_q_d)

    denominator = sum(k**2 for k in resultDocVector)
    denominator *= sum(k**2 for k in qVector)
    denominator = math.sqrt(denominator) 

    return dotProduct/denominator

在这里,我们摆脱了所有的for循环。形式为 code 的代码。

lst = []
for item in other_lst:
    lst.append(somefunc(item))

使用循环构建列表是最慢的方法。首先,for/while循环本身就很慢,而且向列表中添加元素也很慢。你同时拥有了两个最差的方面。一个好的态度是像对待征税一样编码(从性能角度考虑)。只有在无法使用map或者推导式完成某些操作时,或者这样做可以使你的代码更易读并且你知道它不是瓶颈时,才付出循环的代价。一旦你习惯了推导式,它们就非常易读。


1
我认为提到即使是这个函数中的小改变也可以改善代码,可能会有一些学习效果。例如用idx+=1替换idx=idx+1,用有意义的for循环替换while,最后但并非最不重要的是像你所做的那样将它们合并成易于理解的一行代码。 - erikbstack
+1 感谢您提供的这些信息。老实说,我之前一个都不知道。我是通过在线教程学习 Python 的,同时也是 C 和汇编程序员。所以我的编码规范就是那样的。关于您的评论,比如“构建列表最慢的方式”和“生成器表达式通常比 for 循环快两倍”,您是怎么知道这些的呢?我的意思是,有没有什么书可以说明某些方法比其他方法更快?如果有,请推荐一本。 - claws
@erikb:将while替换为有意义的for循环也会带来性能改进吗? :O - claws
@claws:是的,具备一些基本的编程技能肯定会提高程序的平均性能。我的建议适用于所有语言,如C或更高级别的语言。至于汇编语言,我不知道,因为我从未使用过。关于你的书籍问题:几乎所有好的编程书籍都可以帮助你。例如,在这里查找:http://stackoverflow.com/questions/194812/list-of-freely-available-programming-books - erikbstack

4

这是更微小的优化,与上面的@aaronasterling精神相似。尽管如此,我认为这些观察值值得考虑。

使用适当的数据类型

stopwords 应该是一个集合。你不能反复搜索一个列表并期望它快速。

使用更多的集合。它们像列表一样可迭代,但当你必须搜索它们时,它们比列表快得多。


列表推导式

resultSquares = [k*k for k in resultDocVector]
qSquares = [k*k for k in qVector]
TFs = [TF(term,doc) for term in docs[doc]]

生成器

将这个转换为:

for qword in qwords:
    qword = qword.lower()
    if qword in stopwords:
        continue
    qterm = stemmer.stem(qword,0,len(qword)-1) 
    qterms1.append(qterm)
qterms = list(set(qterms1))

转换成这样:

qworditer = (qword.lower() for qword in qwords if qword not in stopwords)
qtermiter = (stemmer.stem(qword,0,len(qword)-1) for qword in qworditer)
qterms1 = set([qterm for qterm in qtermiter])

Use generators and reduce():

Turn this:

prev = set()
i = 0
for qterm in qterms:
    if InvertedIndex.has_key(qterm):
        if i == 0:
            prev = set(InvertedIndex[qterm])
            i = i+1
            continue
        prev = prev.intersection(set(InvertedIndex[qterm]))

results = list(prev)

转化为如下形式:

qtermiter = (set(InvertedIndex[qterm]) for qterm in qterms if qterm in InvertedIndex)
results = reduce(set.intersection, qtermiter)

使用列表推导式

不要这样写:

i = 0
docComponents = []
for doc in results:
        docComponents.append([])

i = 0    
for doc in results:
    for term in docs[doc]:
        vals = getValues4Term(term,doc)#[TermFrequency, DocumentFrequency]
        docComponents[i].append(vals)
    i = i+1

请写下以下内容:

docComponents = [getValues4Term(term,doc) for doc in results for term in docs[doc]]

这段代码没有意义:
for doc in results:
        resultDocVectors.append([])

for i in range(0,len(results)):
    for j in range(0,len(docs[doc])):
        tf = docComponents[i][j][0]['natural']#0:TermFrequency
        idf = docComponents[i][j][1]['idf'] #1:DocumentFrequency        
        resultDocVectors[i].append(tf*idf)

len(docs[doc])取决于doc,而doc的值是在循环for doc in results中最后到达的。


使用collections.defaultdict

不要这样做:

relevances = {}
fh = open("qrels.txt")
lines = fh.readlines()
for line in lines:
    cols = line.strip().split()
    if relevances.has_key(cols[0]):#queryID
        relevances[cols[0]].append(cols[2])#docID
    else:
        relevances[cols[0]] = [cols[2]]

假设您的文件每行只有三个字段,请写入以下内容:
from collections import defaultdict
relevances = defaultdict(list)
with open("qrels.txt") as fh:
    lineiter = (line.strip().split() for line in fh)
    for queryID, _, docID in lineiter:
        relevances[queryID].append(docID)

像其他人一样,记忆化你的计算。


2010-10-21: 关于停用词的更新。

from datetime import datetime

stopwords = ['a' , 'a\'s' , 'able' , 'about' , 'above' , 'according' , 'accordingly' , 'across' , 'actually' , 'after' , 'afterwards' , 'again' , 'against' , 'ain\'t' , 'all' , 'allow' , 'allows' , 'almost' , 'alone' , 'along' , 'already' , 'also' , 'although' , 'always' , 'am' , 'among' , 'amongst' , 'an' , 'and' , 'another' , 'any' , 'anybody' , 'anyhow' , 'anyone' , 'anything' , 'anyway' , 'anyways' , 'anywhere' , 'apart' , 'appear' , 'appreciate' , 'appropriate' , 'are' , 'aren\'t' , 'around' , 'as' , 'aside' , 'ask' , 'asking' , 'associated' , 'at' , 'available' , 'away' , 'awfully' , 'b' , 'be' , 'became' , 'because' , 'become' , 'becomes' , 'becoming' , 'been' , 'before' , 'beforehand' , 'behind' , 'being' , 'believe' , 'below' , 'beside' , 'besides' , 'best' , 'better' , 'between' , 'beyond' , 'both' , 'brief' , 'but' , 'by' , 'c' , 'c\'mon' , 'c\'s' , 'came' , 'can' , 'can\'t' , 'cannot' , 'cant' , 'cause' , 'causes' , 'certain' , 'certainly' , 'changes' , 'clearly' , 'co' , 'com' , 'come' , 'comes' , 'concerning' , 'consequently' , 'consider' , 'considering' , 'contain' , 'containing' , 'contains' , 'corresponding' , 'could' , 'couldn\'t' , 'course' , 'currently' , 'd' , 'definitely' , 'described' , 'despite' , 'did' , 'didn\'t' , 'different' , 'do' , 'does' , 'doesn\'t' , 'doing' , 'don\'t' , 'done' , 'down' , 'downwards' , 'during' , 'e' , 'each' , 'edu' , 'eg' , 'eight' , 'either' , 'else' , 'elsewhere' , 'enough' , 'entirely' , 'especially' , 'et' , 'etc' , 'even' , 'ever' , 'every' , 'everybody' , 'everyone' , 'everything' , 'everywhere' , 'ex' , 'exactly' , 'example' , 'except' , 'f' , 'far' , 'few' , 'fifth' , 'first' , 'five' , 'followed' , 'following' , 'follows' , 'for' , 'former' , 'formerly' , 'forth' , 'four' , 'from' , 'further' , 'furthermore' , 'g' , 'get' , 'gets' , 'getting' , 'given' , 'gives' , 'go' , 'goes' , 'going' , 'gone' , 'got' , 'gotten' , 'greetings' , 'h' , 'had' , 'hadn\'t' , 'happens' , 'hardly' , 'has' , 'hasn\'t' , 'have' , 'haven\'t' , 'having' , 'he' , 'he\'s' , 'hello' , 'help' , 'hence' , 'her' , 'here' , 'here\'s' , 'hereafter' , 'hereby' , 'herein' , 'hereupon' , 'hers' , 'herself' , 'hi' , 'him' , 'himself' , 'his' , 'hither' , 'hopefully' , 'how' , 'howbeit' , 'however' , 'i' , 'i\'d' , 'i\'ll' , 'i\'m' , 'i\'ve' , 'ie' , 'if' , 'ignored' , 'immediate' , 'in' , 'inasmuch' , 'inc' , 'indeed' , 'indicate' , 'indicated' , 'indicates' , 'inner' , 'insofar' , 'instead' , 'into' , 'inward' , 'is' , 'isn\'t' , 'it' , 'it\'d' , 'it\'ll' , 'it\'s' , 'its' , 'itself' , 'j' , 'just' , 'k' , 'keep' , 'keeps' , 'kept' , 'know' , 'knows' , 'known' , 'l' , 'last' , 'lately' , 'later' , 'latter' , 'latterly' , 'least' , 'less' , 'lest' , 'let' , 'let\'s' , 'like' , 'liked' , 'likely' , 'little' , 'look' , 'looking' , 'looks' , 'ltd' , 'm' , 'mainly' , 'many' , 'may' , 'maybe' , 'me' , 'mean' , 'meanwhile' , 'merely' , 'might' , 'more' , 'moreover' , 'most' , 'mostly' , 'much' , 'must' , 'my' , 'myself' , 'n' , 'name' , 'namely' , 'nd' , 'near' , 'nearly' , 'necessary' , 'need' , 'needs' , 'neither' , 'never' , 'nevertheless' , 'new' , 'next' , 'nine' , 'no' , 'nobody' , 'non' , 'none' , 'noone' , 'nor' , 'normally' , 'not' , 'nothing' , 'novel' , 'now' , 'nowhere' , 'o' , 'obviously' , 'of' , 'off' , 'often' , 'oh' , 'ok' , 'okay' , 'old' , 'on' , 'once' , 'one' , 'ones' , 'only' , 'onto' , 'or' , 'other' , 'others' , 'otherwise' , 'ought' , 'our' , 'ours' , 'ourselves' , 'out' , 'outside' , 'over' , 'overall' , 'own' , 'p' , 'particular' , 'particularly' , 'per' , 'perhaps' , 'placed' , 'please' , 'plus' , 'possible' , 'presumably' , 'probably' , 'provides' , 'q' , 'que' , 'quite' , 'qv' , 'r' , 'rather' , 'rd' , 're' , 'really' , 'reasonably' , 'regarding' , 'regardless' , 'regards' , 'relatively' , 'respectively' , 'right' , 's' , 'said' , 'same' , 'saw' , 'say' , 'saying' , 'says' , 'second' , 'secondly' , 'see' , 'seeing' , 'seem' , 'seemed' , 'seeming' , 'seems' , 'seen' , 'self' , 'selves' , 'sensible' , 'sent' , 'serious' , 'seriously' , 'seven' , 'several' , 'shall' , 'she' , 'should' , 'shouldn\'t' , 'since' , 'six' , 'so' , 'some' , 'somebody' , 'somehow' , 'someone' , 'something' , 'sometime' , 'sometimes' , 'somewhat' , 'somewhere' , 'soon' , 'sorry' , 'specified' , 'specify' , 'specifying' , 'still' , 'sub' , 'such' , 'sup' , 'sure' , 't' , 't\'s' , 'take' , 'taken' , 'tell' , 'tends' , 'th' , 'than' , 'thank' , 'thanks' , 'thanx' , 'that' , 'that\'s' , 'thats' , 'the' , 'their' , 'theirs' , 'them' , 'themselves' , 'then' , 'thence' , 'there' , 'there\'s' , 'thereafter' , 'thereby' , 'therefore' , 'therein' , 'theres' , 'thereupon' , 'these' , 'they' , 'they\'d' , 'they\'ll' , 'they\'re' , 'they\'ve' , 'think' , 'third' , 'this' , 'thorough' , 'thoroughly' , 'those' , 'though' , 'three' , 'through' , 'throughout' , 'thru' , 'thus' , 'to' , 'together' , 'too' , 'took' , 'toward' , 'towards' , 'tried' , 'tries' , 'truly' , 'try' , 'trying' , 'twice' , 'two' , 'u' , 'un' , 'under' , 'unfortunately' , 'unless' , 'unlikely' , 'until' , 'unto' , 'up' , 'upon' , 'us' , 'use' , 'used' , 'useful' , 'uses' , 'using' , 'usually' , 'uucp' , 'v' , 'value' , 'various' , 'very' , 'via' , 'viz' , 'vs' , 'w' , 'want' , 'wants' , 'was' , 'wasn\'t' , 'way' , 'we' , 'we\'d' , 'we\'ll' , 'we\'re' , 'we\'ve' , 'welcome' , 'well' , 'went' , 'were' , 'weren\'t' , 'what' , 'what\'s' , 'whatever' , 'when' , 'whence' , 'whenever' , 'where' , 'where\'s' , 'whereafter' , 'whereas' , 'whereby' , 'wherein' , 'whereupon' , 'wherever' , 'whether' , 'which' , 'while' , 'whither' , 'who' , 'who\'s' , 'whoever' , 'whole' , 'whom' , 'whose' , 'why' , 'will' , 'willing' , 'wish' , 'with' , 'within' , 'without' , 'won\'t' , 'wonder' , 'would' , 'would' , 'wouldn\'t' , 'x' , 'y' , 'yes' , 'yet' , 'you' , 'you\'d' , 'you\'ll' , 'you\'re' , 'you\'ve' , 'your' , 'yours' , 'yourself' , 'yourselves' , 'z' , 'zero']
print len(stopwords)
dictfile = '/usr/share/dict/american-english-huge'
with open(dictfile) as f:
    words = [line.strip() for line in f]

print len(words)

s = datetime.now()
total = sum(1 for word in words if word in stopwords)
e = datetime.now()
elapsed = e - s
print elapsed, total

s = datetime.now()
stopwords_set = set(stopwords)
total = sum(1 for word in words if word in stopwords_set)
e = datetime.now()
elapsed = e - s
print elapsed, total

我得到了以下结果:
# Using list
>>> print elapsed, total
0:00:06.902529 542

# Using set
>>> print elapsed, total
0:00:00.050676 542

同样的结果,但一个运行速度几乎快了140倍。当然,您可能没有那么多单词与您的“停用词”进行比较,而6秒对于30分钟以上的运行时间来说是微不足道的。但这确实强调了使用适当的数据结构可以加速您的代码。

2
很高兴看到你在别人要求时发布了你的代码,但除非他们运行并进行分析,否则他们最多只能猜测。我也可以猜测,但即使这些猜测是“好的”或“有根据的”,它们也不是找到性能问题的好方法。
我更愿意向你推荐一种技术来查明问题所在。这比猜测或要求其他人猜测更有效。一旦你自己确定了问题的确切位置,你就可以决定是否使用记忆化等方式来解决它。
通常会有多个问题。如果你重复查找和解决性能问题的过程,你将接近真正的最优解。

1
Python是否缓存函数结果?我认为不是这样。在这种情况下,运行像TF(term,doc)这样的循环函数多次可能是一个坏主意,因为当你将结果放入变量中时,你可能已经获得了巨大的速度提升。结合使用这个方法,可以更好地优化getValues4Term()函数。
for doc in results:
    for term in docs[doc]:
        vals = getValues4Term(term,doc)

可能已经是你面临的最大速度问题了。


好的观点!你能告诉我应该查看什么来进行缓存吗? - claws
Alex已经描述了如何使用记忆化。这对Python很有效。一个适用于所有语言的解决方案是在函数开始时将TF(term,doc)的结果存储在变量中,然后只需使用该变量而不是TF(term,doc)调用即可。 - erikbstack

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接