使用Python多线程读取txt文件

Question

使用Python多线程读取txt文件

pythonmultithreadingtext-files

28

我试图在Python中读取文件（扫描其行并查找术语），并编写结果-假设每个术语的计数器。我需要为大量的文件（超过3000个）执行此操作。是否可以使用多线程完成？如果可以，如何实现？

因此，场景如下：

读取每个文件并扫描其行
为我阅读的所有文件将计数器写入同一输出文件。

第二个问题是，这是否会提高读/写速度。

希望已经足够清楚。谢谢， Ron.

- Ron D.

2个回答

1

是的，这应该可以以并行方式完成。

然而，在Python中使用多线程实现并行性很困难。因此，multiprocessing 是在并行处理方面更好的默认选择。

很难说您可以期望实现什么样的加速。这取决于有多少工作量可以并行完成（越多越好），以及必须串行完成多少工作量（越少越好）。

- NPE

1

然而，在Python中使用多线程实现并行性是比较困难的。这是因为Python解释器有一个全局锁（GIL），它会在任何时候只允许一个线程执行Python字节码。因此，即使你在多个线程中运行Python代码，也只能有一个线程在任何时候真正地执行代码。 - Tagar

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Austin Marshall · Accepted Answer

我同意@aix的看法，multiprocessing绝对是可行的方法。无论如何你都会受到I/O限制 -- 无论有多少并行进程在运行，你只能读取得那么快。但是可以很容易地获得一些加速。

考虑以下情况（input/是包含几个来自Project Gutenberg的.txt文件的目录）。

import os.path
from multiprocessing import Pool
import sys
import time

def process_file(name):
    ''' Process one file: count number of lines and words '''
    linecount=0
    wordcount=0
    with open(name, 'r') as inp:
        for line in inp:
            linecount+=1
            wordcount+=len(line.split(' '))

    return name, linecount, wordcount

def process_files_parallel(arg, dirname, names):
    ''' Process each file in parallel via Poll.map() '''
    pool=Pool()
    results=pool.map(process_file, [os.path.join(dirname, name) for name in names])

def process_files(arg, dirname, names):
    ''' Process each file in via map() '''
    results=map(process_file, [os.path.join(dirname, name) for name in names])

if __name__ == '__main__':
    start=time.time()
    os.path.walk('input/', process_files, None)
    print "process_files()", time.time()-start

    start=time.time()
    os.path.walk('input/', process_files_parallel, None)
    print "process_files_parallel()", time.time()-start

在我的双核机器上运行这个程序时，有明显的速度提升（但不是两倍）：

$ python process_files.py
process_files() 1.71218085289
process_files_parallel() 1.28905105591

如果文件的大小足够小而可以放入内存，并且您有大量需要处理的任务不受 I/O 限制，那么您应该会看到更好的性能提升。