Python大型多列表高效查询

4

我试图创建使用Python操纵由CSV表组成的大型数据库的示例。

我想找到一种在通过某些list()分散的表中模拟高效索引查询的方法。

以下示例在3.2Ghz Core i5上需要24秒。

#!/usr/bin/env python
import csv
MAINDIR = "../"
pf = open (MAINDIR+"atp_players.csv")
players = [p for p in csv.reader(pf)]
rf = open (MAINDIR+"atp_rankings_current.csv")
rankings = [r for r in csv.reader(rf)]
for i in rankings[:10]:
    player = filter(lambda x: x[0]==i[2],players)[0]
    print "%s(%s),(%s) Points: %s"%(player[2],player[5],player[3],i[3])
这个数据集 的处理。
希望有更高效、更符合 python 语言特性的方法。

估计每个文件有5000行,你的电脑正在处理5000 * 47 / 24 = 约每秒10000行。我知道你想要更快的速度,但也许数据集太大了。 - EMBLEM
为什么不使用pandas?如何使用pandas的视频可以在这里找到。 - Pedro Salgado
@PadraicCunningham,我相信我们下面的讨论已经清楚地表明了他的逻辑(至少对于“排名”而言)。他制作这个列表是为了只迭代排名的子集。(你的方法更好。) - abcd
3个回答

5
您可以使用 itertools.islice 代替读取所有行,并使用 itertools.ifilter:
import csv
from itertools import islice,ifilter

MAINDIR = "../"
with  open(MAINDIR + "atp_players.csv") as pf,  open(MAINDIR + "atp_rankings_current.csv") as rf:
    players = list(csv.reader(pf))
    rankings = csv.reader(rf)
    # only get first ten rows using islice
    for i in islice(rankings, None, 10):
        # ifilter won't create a list, gives values in the fly
        player = next(ifilter(lambda x: x[0] == i[2], players),"")

你好!我看到你不太确定filter(lambda x: x[0]==i[2],players)[0]在做什么,你似乎是每次都在整个players列表中搜索并只保留第一个元素。建议你可以按照第一个元素作为键进行一次排序,使用二分查找或者构建一个以第一个元素作为键和行作为值的字典,然后直接进行查询。

import csv
from itertools import islice,ifilter
from collections import OrderedDict

MAINDIR = "../"
with  open(MAINDIR + "atp_players.csv") as pf,  open(MAINDIR + "atp_rankings_current.csv") as rf:
    players = OrderedDict((row[0],row) for row in csv.reader(pf))
    rankings = csv.reader(rf)
    for i in islice(rankings, None, 10):
        # now constant work getting row as opposed to 0(n)    
        player = players.get(i[2])

你需要决定使用什么默认值,如果需要的话。

如果你在每一行开头有重复的元素,但只想返回第一次出现的元素:

with  open(MAINDIR + "atp_players.csv") as pf, open(MAINDIR + "atp_rankings_current.csv") as rf:
    players = {}
    for row in csv.reader(pf):
        key = row[0]
        if key in players:
            continue
        players[key] = row
    rankings = csv.reader(rf)
    for i in islice(rankings, None, 10):
        player = players.get(i[2])

输出:

Djokovic(SRB),(R) Points: 11360
Federer(SUI),(R) Points: 9625
Nadal(ESP),(L) Points: 6585
Wawrinka(SUI),(R) Points: 5120
Nishikori(JPN),(R) Points: 5025
Murray(GBR),(R) Points: 4675
Berdych(CZE),(R) Points: 4600
Raonic(CAN),(R) Points: 4440
Cilic(CRO),(R) Points: 4150
Ferrer(ESP),(R) Points: 4045

对于涉及十名玩家的代码计时,ifilter表现最快,但当我们提高排名时,我们将看到dict获胜,并且您的代码缩放效果有多糟糕:

In [33]: %%timeit
MAINDIR = "tennis_atp-master/"
pf = open ("/tennis_atp-master/atp_players.csv")                                                          players = [p for p in csv.reader(pf)]
rf =open( "/tennis_atp-master/atp_rankings_current.csv")
rankings = [r for r in csv.reader(rf)]
for i in rankings[:10]:
    player = filter(lambda x: x[0]==i[2],players)[0]
   ....: 
10 loops, best of 3: 123 ms per loop

In [34]: %%timeit
with  open("/tennis_atp-master/atp_players.csv") as pf, open( "/tennis_atp-master/atp_rankings_current.csv") as rf:                     players = list(csv.reader(pf))
    rankings = csv.reader(rf)    # only get first ten rows using islice
    for i in islice(rankings, None, 10):
        # ifilter won't create a list, gives values in the fly
        player = next(ifilter(lambda x: x[0] == i[2], players),"")
   ....: 
10 loops, best of 3: 43.6 ms per loop

In [35]: %%timeit                           
with  open("/tennis_atp-master/atp_players.csv") as pf, open( "/tennis_atp-master/atp_rankings_current.csv") as rf:
    players = {}
    for row in csv.reader(pf):
        key = row[0]
        if key in players:
            continue
        players[row[0]] = row
    rankings = csv.reader(rf)
    for i in islice(rankings, None, 10):
        player = players.get(i[2])
        pass
   ....: 
10 loops, best of 3: 50.7 ms per loop

现在,有100个玩家时,您会发现字典的速度与10个玩家时一样快。建立字典的成本已经被常数时间查找所抵消:

In [38]: %%timeit
with  open("/tennis_atp-master/atp_players.csv") as pf, open("/tennis_atp-master/atp_rankings_current.csv") as rf:
    players = list(csv.reader(pf))
    rankings = csv.reader(rf)
    # only get first ten rows using islice
    for i in islice(rankings, None, 100):
        # ifilter won't create a list, gives values in the fly
        player = next(ifilter(lambda x: x[0] == i[2], players),"")
   ....: 
10 loops, best of 3: 120 ms per loop

In [39]: %%timeit
with  open("/tennis_atp-master/atp_players.csv") as pf, open( "/tennis_atp-master/atp_rankings_current.csv") as rf:
    players = {}                  
    for row in csv.reader(pf):
        key = row[0]
        if key in players:
            continue                                          
        players[row[0]] = row                                     
    rankings = csv.reader(rf)
    for i in islice(rankings, None, 100):
        player = players.get(i[2])
        pass
   ....: 
10 loops, best of 3: 50.7 ms per loop

In [40]: %%timeit
MAINDIR = "tennis_atp-master/"
pf = open ("/tennis_atp-master/atp_players.csv")
players = [p for p in csv.reader(pf)]
rf =open( "/tennis_atp-master/atp_rankings_current.csv")
rankings = [r for r in csv.reader(rf)]
for i in rankings[:100]:
    player = filter(lambda x: x[0]==i[2],players)[0]
   ....: 
1 loops, best of 3: 806 ms per loop

对于250名玩家:

# your code
1 loops, best of 3: 1.86 s per loop

# dict
10 loops, best of 3: 50.7 ms per loop

# ifilter
10 loops, best of 3: 483  ms per loop

整个排名循环的最终测试:
# your code

1 loops, best of 3: 2min 40s per loop

# dict 
10 loops, best of 3: 67 ms per loop

# ifilter
1 loops, best of 3: 1min 3s per loop

当我们循环遍历更多的排名时,可以看到使用dict选项在运行时效率上远远超过其他选项,并且可以非常好地扩展。


2
我本来想建议枚举读取器进行迭代,然后中断——这样更好。 - jedwards
为什么在迭代rankings的一小部分时还要将其全部存储在内存中? - abcd
2
@dbliss,我把所有排名都存储在哪里了? - Padraic Cunningham
1
@dbliss,我觉得你并不真正理解csv.reader对象是什么。 - Padraic Cunningham
1
@dbliss csv.reader 返回一个文件迭代器,因此它只会按需从文件中消耗数据,类似于函数式语言或类似语言的惰性求值。当你说"csv.reader返回了一些东西,而且不需要它所返回的全部内容"时,这暗示了一个假设,即它是预先获取数据的,但实际上它并不是这样。实际的reader对象本身非常简单,与数据无关,除了按需迭代它。 - ely
显示剩余5条评论

2
考虑将数据放入SQLite数据库中。这符合您仅使用Python的要求,因为它内置于标准Python库中,并在(几乎)所有Python解释器中都得到支持。SQLite是一个数据库库,它允许您使用SQL语法对数据进行处理。它提供了索引和外键关系等功能。
如果您需要对数据执行多个查询,则进行一些预计算(即索引和数据规范化)是最明智的路线。

0

这段代码运行时间不长。所以我假设你实际上是在遍历超过10个排名。当我遍历它们时,需要很长时间。如果你想要的是这样的效果,那么使用字典可以缩短搜索时间。为了设置字典的一些开销,你可以非常快速地搜索它。以下是我修改后的for循环:

play_dict = {}
for index, player in enumerate(players):
    play_dict[player[0]] = index
for i in rankings[:10]:
    player = players[play_dict[i[2]]]

使用这段代码,您可以立即处理所有排名。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接