我尝试了两种方法,用一个只有10行的小测试文件 - 解析整个文件并选择最后N行,或加载所有行,但只解析最后N行:
In [1025]: timeit np.genfromtxt('stack38704949.txt',delimiter=',')[-5:]
1000 loops, best of 3: 741 µs per loop
In [1026]: %%timeit
...: with open('stack38704949.txt','rb') as f:
...: lines = f.readlines()
...: np.genfromtxt(lines[-5:],delimiter=',')
1000 loops, best of 3: 378 µs per loop
这被标记为Efficiently Read last 'n' rows of CSV into DataFrame的重复问题。那里接受的答案使用了
from collections import deque
并收集了该结构中的最后N行。它还使用StringIO
将这些行提供给解析器,这是不必要的复杂性。genfromtxt
从任何给定它行的地方获取输入,因此行的列表完全可以。
In [1031]: %%timeit
...: with open('stack38704949.txt','rb') as f:
...: lines = deque(f,5)
...: np.genfromtxt(lines,delimiter=',')
1000 loops, best of 3: 382 µs per loop
与readlines
和切片操作基本相同。
deque
可能在文件非常大且保留所有行变得昂贵时具有优势。我认为它并没有节省任何文件读取时间。每一行仍然必须逐个读取。
row_count
和skip_header
方法的计时较慢;它需要两次读取文件。skip_header
仍然需要逐行读取。
In [1046]: %%timeit
...: with open('stack38704949.txt',"r") as f:
...: ...: reader = csv.reader(f,delimiter = ",")
...: ...: data = list(reader)
...: ...: row_count = len(data)
...: np.genfromtxt('stack38704949.txt',skip_header=row_count-5,delimiter=',')
The slowest run took 5.96 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 760 µs per loop
为了计算行数,我们不需要使用csv.reader
,尽管它似乎不会花费太多额外时间。
In [1048]: %%timeit
...: with open('stack38704949.txt',"r") as f:
...: lines=f.readlines()
...: row_count = len(data)
...: np.genfromtxt('stack38704949.txt',skip_header=row_count-5,delimiter=',')
1000 loops, best of 3: 736 µs per loop