您可以使用
itertools.islice
代替读取所有行,并使用
itertools.ifilter
:
import csv
from itertools import islice,ifilter
MAINDIR = "../"
with open(MAINDIR + "atp_players.csv") as pf, open(MAINDIR + "atp_rankings_current.csv") as rf:
players = list(csv.reader(pf))
rankings = csv.reader(rf)
for i in islice(rankings, None, 10):
player = next(ifilter(lambda x: x[0] == i[2], players),"")
你好!我看到你不太确定filter(lambda x: x[0]==i[2],players)[0]
在做什么,你似乎是每次都在整个players列表中搜索并只保留第一个元素。建议你可以按照第一个元素作为键进行一次排序,使用二分查找或者构建一个以第一个元素作为键和行作为值的字典,然后直接进行查询。
import csv
from itertools import islice,ifilter
from collections import OrderedDict
MAINDIR = "../"
with open(MAINDIR + "atp_players.csv") as pf, open(MAINDIR + "atp_rankings_current.csv") as rf:
players = OrderedDict((row[0],row) for row in csv.reader(pf))
rankings = csv.reader(rf)
for i in islice(rankings, None, 10):
player = players.get(i[2])
你需要决定使用什么默认值,如果需要的话。
如果你在每一行开头有重复的元素,但只想返回第一次出现的元素:
with open(MAINDIR + "atp_players.csv") as pf, open(MAINDIR + "atp_rankings_current.csv") as rf:
players = {}
for row in csv.reader(pf):
key = row[0]
if key in players:
continue
players[key] = row
rankings = csv.reader(rf)
for i in islice(rankings, None, 10):
player = players.get(i[2])
输出:
Djokovic(SRB),(R) Points: 11360
Federer(SUI),(R) Points: 9625
Nadal(ESP),(L) Points: 6585
Wawrinka(SUI),(R) Points: 5120
Nishikori(JPN),(R) Points: 5025
Murray(GBR),(R) Points: 4675
Berdych(CZE),(R) Points: 4600
Raonic(CAN),(R) Points: 4440
Cilic(CRO),(R) Points: 4150
Ferrer(ESP),(R) Points: 4045
对于涉及十名玩家的代码计时,ifilter表现最快,但当我们提高排名时,我们将看到dict获胜,并且您的代码缩放效果有多糟糕:
In [33]: %%timeit
MAINDIR = "tennis_atp-master/"
pf = open ("/tennis_atp-master/atp_players.csv") players = [p for p in csv.reader(pf)]
rf =open( "/tennis_atp-master/atp_rankings_current.csv")
rankings = [r for r in csv.reader(rf)]
for i in rankings[:10]:
player = filter(lambda x: x[0]==i[2],players)[0]
....:
10 loops, best of 3: 123 ms per loop
In [34]: %%timeit
with open("/tennis_atp-master/atp_players.csv") as pf, open( "/tennis_atp-master/atp_rankings_current.csv") as rf: players = list(csv.reader(pf))
rankings = csv.reader(rf)
for i in islice(rankings, None, 10):
player = next(ifilter(lambda x: x[0] == i[2], players),"")
....:
10 loops, best of 3: 43.6 ms per loop
In [35]: %%timeit
with open("/tennis_atp-master/atp_players.csv") as pf, open( "/tennis_atp-master/atp_rankings_current.csv") as rf:
players = {}
for row in csv.reader(pf):
key = row[0]
if key in players:
continue
players[row[0]] = row
rankings = csv.reader(rf)
for i in islice(rankings, None, 10):
player = players.get(i[2])
pass
....:
10 loops, best of 3: 50.7 ms per loop
现在,有100个玩家时,您会发现字典的速度与10个玩家时一样快。建立字典的成本已经被常数时间查找所抵消:
In [38]: %%timeit
with open("/tennis_atp-master/atp_players.csv") as pf, open("/tennis_atp-master/atp_rankings_current.csv") as rf:
players = list(csv.reader(pf))
rankings = csv.reader(rf)
for i in islice(rankings, None, 100):
player = next(ifilter(lambda x: x[0] == i[2], players),"")
....:
10 loops, best of 3: 120 ms per loop
In [39]: %%timeit
with open("/tennis_atp-master/atp_players.csv") as pf, open( "/tennis_atp-master/atp_rankings_current.csv") as rf:
players = {}
for row in csv.reader(pf):
key = row[0]
if key in players:
continue
players[row[0]] = row
rankings = csv.reader(rf)
for i in islice(rankings, None, 100):
player = players.get(i[2])
pass
....:
10 loops, best of 3: 50.7 ms per loop
In [40]: %%timeit
MAINDIR = "tennis_atp-master/"
pf = open ("/tennis_atp-master/atp_players.csv")
players = [p for p in csv.reader(pf)]
rf =open( "/tennis_atp-master/atp_rankings_current.csv")
rankings = [r for r in csv.reader(rf)]
for i in rankings[:100]:
player = filter(lambda x: x[0]==i[2],players)[0]
....:
1 loops, best of 3: 806 ms per loop
对于250名玩家:
# your code
1 loops, best of 3: 1.86 s per loop
# dict
10 loops, best of 3: 50.7 ms per loop
# ifilter
10 loops, best of 3: 483 ms per loop
整个排名循环的最终测试:
# your code
1 loops, best of 3: 2min 40s per loop
# dict
10 loops, best of 3: 67 ms per loop
# ifilter
1 loops, best of 3: 1min 3s per loop
当我们循环遍历更多的排名时,可以看到使用dict选项在运行时效率上远远超过其他选项,并且可以非常好地扩展。