如何优化我的PageRank计算？

Question

如何优化我的PageRank计算？

3

在书籍集体智慧编程中，我找到了以下用于计算PageRank的函数：

def calculatepagerank(self,iterations=20):
    # clear out the current PageRank tables
    self.con.execute("drop table if exists pagerank")
    self.con.execute("create table pagerank(urlid primary key,score)")
    self.con.execute("create index prankidx on pagerank(urlid)")

    # initialize every url with a PageRank of 1.0
    self.con.execute("insert into pagerank select rowid,1.0 from urllist")
    self.dbcommit()

    for i in range(iterations):
        print "Iteration %d" % i
        for (urlid,) in self.con.execute("select rowid from urllist"):
            pr=0.15

            # Loop through all the pages that link to this one
            for (linker,) in self.con.execute("select distinct fromid from link where toid=%d" % urlid):
                # Get the PageRank of the linker
                linkingpr=self.con.execute("select score from pagerank where urlid=%d" % linker).fetchone()[0]

                # Get the total number of links from the linker
                linkingcount=self.con.execute("select count(*) from link where fromid=%d" % linker).fetchone()[0]

                pr+=0.85*(linkingpr/linkingcount)

            self.con.execute("update pagerank set score=%f where urlid=%d" % (pr,urlid))
        self.dbcommit()

然而，由于每次迭代中的所有SQL查询，这个函数非常缓慢。

>>> import cProfile
>>> cProfile.run("crawler.calculatepagerank()")
         2262510 function calls in 136.006 CPU seconds

   Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.000    0.000  136.006  136.006 <string>:1(<module>)
     1   20.826   20.826  136.006  136.006 searchengine.py:179(calculatepagerank)
    21    0.000    0.000    0.528    0.025 searchengine.py:27(dbcommit)
    21    0.528    0.025    0.528    0.025 {method 'commit' of 'sqlite3.Connecti
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler
1339864  112.602    0.000  112.602    0.000 {method 'execute' of 'sqlite3.Connec 
922600    2.050    0.000    2.050    0.000 {method 'fetchone' of 'sqlite3.Cursor' 
     1    0.000    0.000    0.000    0.000 {range}

所以我优化了这个函数，得出了这个结果：

def calculatepagerank2(self,iterations=20):
    # clear out the current PageRank tables
    self.con.execute("drop table if exists pagerank")
    self.con.execute("create table pagerank(urlid primary key,score)")
    self.con.execute("create index prankidx on pagerank(urlid)")

    # initialize every url with a PageRank of 1.0
    self.con.execute("insert into pagerank select rowid,1.0 from urllist")
    self.dbcommit()

    inlinks={}
    numoutlinks={}
    pagerank={}

    for (urlid,) in self.con.execute("select rowid from urllist"):
        inlinks[urlid]=[]
        numoutlinks[urlid]=0
        # Initialize pagerank vector with 1.0
        pagerank[urlid]=1.0
        # Loop through all the pages that link to this one
        for (inlink,) in self.con.execute("select distinct fromid from link where toid=%d" % urlid):
            inlinks[urlid].append(inlink)
            # get number of outgoing links from a page        
            numoutlinks[urlid]=self.con.execute("select count(*) from link where fromid=%d" % urlid).fetchone()[0]            

    for i in range(iterations):
        print "Iteration %d" % i

        for urlid in pagerank:
            pr=0.15
            for link in inlinks[urlid]:
                linkpr=pagerank[link]
                linkcount=numoutlinks[link]
                pr+=0.85*(linkpr/linkcount)
            pagerank[urlid]=pr
    for urlid in pagerank:
        self.con.execute("update pagerank set score=%f where urlid=%d" % (pagerank[urlid],urlid))
    self.dbcommit()

这个函数比起每次迭代都进行不必要的SQL查询，速度快了很多（但是会使用更多内存来存储所有临时字典）。

>>> cProfile.run("crawler.calculatepagerank2()")
     90070 function calls in 3.527 CPU seconds
Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.004    0.004    3.527    3.527 <string>:1(<module>)
     1    1.154    1.154    3.523    3.523 searchengine.py:207(calculatepagerank2
     2    0.000    0.000    0.058    0.029 searchengine.py:27(dbcommit)
 23065    0.013    0.000    0.013    0.000 {method 'append' of 'list' objects}
     2    0.058    0.029    0.058    0.029 {method 'commit' of 'sqlite3.Connectio
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler
 43932    2.261    0.000    2.261    0.000 {method 'execute' of 'sqlite3.Connecti
 23065    0.037    0.000    0.037    0.000 {method 'fetchone' of 'sqlite3.Cursor'
     1    0.000    0.000    0.000    0.000 {range}

但是有没有可能进一步减少SQL查询的数量，以加快函数的速度呢？更新: 修复了calculatepagerank2()中的缩进问题。

- asmaier

4个回答

1

我相信大部分时间都花在了这些SQL查询上：

for (urlid,) in self.con.execute("select rowid from urllist"):
    ...
    for (inlink,) in self.con.execute("select distinct fromid from link where toid=%d" % urlid):
        ...
        numoutlinks[urlid]=self.con.execute("select count(*) from link where fromid=%d" % urlid).fetchone()[0]

假设您有足够的内存，您可以将此减少为仅两个查询：

SELECT fromid,toid FROM link WHERE toid IN (SELECT rowid FROM urllist)
和
SELECT fromid,count(*) FROM link WHERE fromid IN (SELECT rowid FROM urllist) GROUP BY fromid

然后，您可以循环遍历结果并构建inlinks、numoutlinks和pagerank。

您还可以受益于使用collections.defaultdict：

import collections
import itertools
def constant_factory(value):
    return itertools.repeat(value).next

以下代码将使inlinks变为一个字典的集合。使用集合非常适合，因为您只想要不同的URL。

inlinks=collections.defaultdict(set)

这使得pagerank成为一个默认值为1.0的字典：

pagerank=collections.defaultdict(constant_factory(1.0))

使用collections.defaultdict的优点是您无需预先初始化字典。

因此，综合起来，我建议的做法应该像这样：

import collections
def constant_factory(value):
    return itertools.repeat(value).next
def calculatepagerank2(self,iterations=20):
    # clear out the current PageRank tables
    self.con.execute("DROP TABLE IF EXISTS pagerank")
    self.con.execute("CREATE TABLE pagerank(urlid primary key,score)")
    self.con.execute("CREATE INDEX prankidx ON pagerank(urlid)")

    # initialize every url with a PageRank of 1.0
    self.con.execute("INSERT INTO pagerank SELECT rowid,1.0 FROM urllist")
    self.dbcommit()

    inlinks=collections.defaultdict(set)

    sql='''SELECT fromid,toid FROM link WHERE toid IN (SELECT rowid FROM urllist)'''
    for f,t in self.con.execute(sql):
        inlinks[t].add(f)

    numoutlinks={}
    sql='''SELECT fromid,count(*) FROM link WHERE fromid IN (SELECT rowid FROM urllist) GROUP BY fromid'''
    for f,c in self.con.execute(sql):
        numoutlinks[f]=c

    pagerank=collections.defaultdict(constant_factory(1.0))
    for i in range(iterations):
        print "Iteration %d" % i
        for urlid in inlinks:
            pr=0.15
            for link in inlinks[urlid]:
                linkpr=pagerank[link]
                linkcount=numoutlinks[link]
                pr+=0.85*(linkpr/linkcount)
            pagerank[urlid]=pr
    sql="UPDATE pagerank SET score=? WHERE urlid=?"
    args=((pagerank[urlid],urlid) for urlid in pagerank)
    self.con.executemany(sql, args)
    self.dbcommit()

- unutbu

0

我正在回答自己的问题，因为最终发现所有答案的混合对我来说效果最好：

    def calculatepagerank4(self,iterations=20):
    # clear out the current PageRank tables
    self.con.execute("drop table if exists pagerank")
    self.con.execute("create table pagerank(urlid primary key,score)")
    self.con.execute("create index prankidx on pagerank(urlid)")

    # initialize every url with a PageRank of 1.0
    self.con.execute("insert into pagerank select rowid,1.0 from urllist")
    self.dbcommit()

    inlinks={}
    numoutlinks={}
    pagerank={}

    for (urlid,) in self.con.execute("select rowid from urllist"):
        inlinks[urlid]=[]
        numoutlinks[urlid]=0
        # Initialize pagerank vector with 1.0
        pagerank[urlid]=1.0

    for src,dest in self.con.execute("select distinct fromid, toid from link"):
        inlinks[dest].append(src)
        numoutlinks[src]+=1          

    for i in range(iterations):
        print "Iteration %d" % i

        for urlid in pagerank:
            pr=0.15
            for link in inlinks[urlid]:
                linkpr=pagerank[link]
                linkcount=numoutlinks[link]
                pr+=0.85*(linkpr/linkcount)
            pagerank[urlid]=pr

    args=((pagerank[urlid],urlid) for urlid in pagerank)
    self.con.executemany("update pagerank set score=? where urlid=?" , args)
    self.dbcommit()

所以我按照allyourcode的建议替换了前两个循环，但是还使用了˜unutbu的解决方案中的executemany()。但与˜unutbu不同的是，我使用了生成器表达式来传递参数，以避免浪费太多内存，尽管使用列表推导式会稍微快一些。最终，这个程序比书中建议的程序快了100倍：

>>> cProfile.run("crawler.calculatepagerank4()")
     33512 function calls in 1.377 CPU seconds
Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.004    0.004    1.377    1.377 <string>:1(<module>)
     2    0.000    0.000    0.073    0.036 searchengine.py:27(dbcommit)
     1    0.693    0.693    1.373    1.373 searchengine.py:286(calculatepagerank4
 10432    0.011    0.000    0.011    0.000 searchengine.py:321(<genexpr>)
 23065    0.009    0.000    0.009    0.000 {method 'append' of 'list' objects}
     2    0.073    0.036    0.073    0.036 {method 'commit' of 'sqlite3.Connectio
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler
     6    0.379    0.063    0.379    0.063 {method 'execute' of 'sqlite3.Connecti
     1    0.209    0.209    0.220    0.220 {method 'executemany' of 'sqlite3.Conn
     1    0.000    0.000    0.000    0.000 {range}

需要注意以下几个问题：

如果您在构建SQL语句时使用字符串格式化的方式，而不是使用一个占位符“?”，那么您会丢失精度（例如，使用“？”得到的结果为2.9796095721920315，而使用“%f”则为2.9796100000000001）。
在默认的PageRank算法中，从一个页面到另一个页面的重复链接被视为单个链接。然而书中的解决方案并没有考虑此问题。
整本书中的算法存在缺陷：因为在每次迭代中，PageRank分数并未存储在第二个表中。这意味着，迭代结果取决于循环遍历的页面顺序，这可能会在多次迭代后大大改变结果。要解决这个问题，可以使用一个额外的表/字典来存储下一次迭代的PageRank，或者使用完全不同的算法，比如Power Iteration。

- asmaier

0

你的RAM足够容纳稀疏矩阵(fromid, toid)吗？这将允许进行大规模优化（具有大规模算法更改）。至少，将你现在在最内层循环中使用select count(*)缓存的(fromid, numlinks)存储在内存中应该会有所帮助（我想象中，如果你正在处理N个URL，则该缓存占用O(N)的空间，更有可能适合内存）。

- Alex Martelli

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- allyourcode · Accepted Answer

如果您有一个非常大的数据库（例如，记录数约为WWW中的页面数），则使用类似于书中建议的方式使用数据库是有意义的，因为您不可能将所有数据都存储在内存中。

如果您的数据集足够小，则可以通过不进行太多查询来改进第二个版本。尝试使用以下内容替换第一个循环：

for urlid, in self.con.execute('select rowid from urllist'):
    inlinks[urlid] = []
    numoutlinks[urlid] = 0
    pagerank[urlid] = 1.0

for src, dest in self.con.execute('select fromid, toid from link'):
    inlinks[dest].append(src)
    numoutlinks[src] += 1

这个版本只执行2次查询，而不是O(n^2)次查询。