我曾尝试查看其他答案,但仍不确定正确的方法。 我有许多非常大的 .csv文件(每个文件可能都有一千兆字节),我想首先获取它们的列标签,因为它们并不完全相同,然后根据用户的偏好使用某些条件提取其中的一些列。 在开始提取部分之前,我进行了简单的测试,以查看解析这些文件的最快方式,以下是我的代码:
def mmapUsage():
start=time.time()
with open("csvSample.csv", "r+b") as f:
# memory-mapInput the file, size 0 means whole file
mapInput = mmap.mmap(f.fileno(), 0)
# read content via standard file methods
L=list()
for s in iter(mapInput.readline, ""):
L.append(s)
print "List length: " ,len(L)
#print "Sample element: ",L[1]
mapInput.close()
end=time.time()
print "Time for completion",end-start
def fileopenUsage():
start=time.time()
fileInput=open("csvSample.csv")
M=list()
for s in fileInput:
M.append(s)
print "List length: ",len(M)
#print "Sample element: ",M[1]
fileInput.close()
end=time.time()
print "Time for completion",end-start
def readAsCsv():
X=list()
start=time.time()
spamReader = csv.reader(open('csvSample.csv', 'rb'))
for row in spamReader:
X.append(row)
print "List length: ",len(X)
#print "Sample element: ",X[1]
end=time.time()
print "Time for completion",end-start
我的结果是:
=======================
Populating list from Mmap
List length: 1181220
Time for completion 0.592000007629
=======================
Populating list from Fileopen
List length: 1181220
Time for completion 0.833999872208
=======================
Populating list by csv library
List length: 1181220
Time for completion 5.06700015068
看起来大多数人使用的csv库比其他库慢得多。 也许当我开始从csv文件中提取数据时,它后来会证明更快,但我还不能确定。 在我开始实施之前有什么建议和提示吗? 非常感谢!
timeit
模块进行类似这样的基准测试。 - nfirvine[]
创建空列表,而非list()
。 - nfirvine