遍历目录

Question

遍历目录

pythonlinux

4

我正在寻找一种方法来遍历一个包含数十万个文件的目录。使用 os.listdir 是非常慢的，因为该函数首先从整个指定路径中获取路径列表。

有什么更快的选项吗？

注意：谁曾经对此进行过投票肯定没有面对过这种情况。

- jldupont

https://dev59.com/0nVD5IYBdhLWcg3wAWkO - squiguy

1

可能是重复的问题：将文件夹中的文件作为流列出以立即开始处理。 - Nemo

1

@squiguy：你所提到的问题与我所追求的不同。 - jldupont

ls -U 开始返回结果的速度有多快？由于不需要对文件进行排序，它可以通过子进程管道将它们提供给您。 - John La Rooy

1

可能是部分目录列表的重复问题。 - unutbu

2个回答

0

你在目录中对每个文件做了什么？我认为使用os.listdir并没有真正的选择，但根据你所做的事情，你可能能够并行处理文件。例如，我们可以使用multiprocessing库中的Pool来生成更多的Python进程，然后让每个进程迭代一个较小的文件子集。

http://docs.python.org/library/multiprocessing.html

这有点粗糙，但我认为它能传达重点...

import sys
import os
from processing import Pool

p = Pool(3)
def work(subsetOfFiles):
    for file in subsetOfFiles:
        with open(file, 'r') as f:
           #read file, do work
    return "data"

p.map(work, [[#subSetFiles1],[#subSetFiles2],[#subSetFiles3]])

一般的想法是从os.listdir获取文件列表，但是不是一个一个地处理超过100,000个文件，而是将100,000个文件分成20个包含5,000个文件的列表，并在每个进程中处理5,000个文件。这种方法的好处之一是它会受益于当前多核系统的趋势。

- Wulfram

我认为OP的问题在于调用 os.listdir 本身需要很长时间，因为该目录中的项目数量很多。所以在这种情况下，除非获取到整个列表，否则映射不会启动。 - jdi

谢谢，我有点误读了问题。我认为即使在那种情况下，您也可以使用我上面概述的方法。而不是一次性获取文件列表，然后将其分割给工作进程，您可以让每个工作进程获取目录中相等的文件子集（可能通过直接shell调用）。我只是认为，当我们谈论数十万个文件时，分而治之是一个好的方法，并且您会通过进程来执行此操作，因为全局解释器锁定。 - Wulfram

磁盘IO通常不是GIL的问题，因此线程仍然可以正常工作。在系统阻塞调用期间，GIL不会被保留。但即使采用分而治之的方法...如何提前拆分目录中的文件？无论如何，都必须进行目录列表，这又是阻碍。在工作方面，您对路径的处理实际上是第二步。 - jdi

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jdi · Accepted Answer

这个问题在评论中被称为重复：
将文件夹中的文件列表作为流以立即开始处理但我发现示例半个不工作。这是对我有效的修复版本：

from ctypes import CDLL, c_int, c_uint8, c_uint16, c_uint32, c_char, c_char_p, Structure, POINTER
from ctypes.util import find_library

import os

class c_dir(Structure):
    pass

class c_dirent(Structure):
    _fields_ = [ 
        ("d_fileno", c_uint32), 
        ("d_reclen", c_uint16),
        ("d_type", c_uint8), 
        ("d_namlen", c_uint8),
        ("d_name", c_char * 4096),
        # proper way of getting platform MAX filename size?
        # ("d_name", c_char * (os.pathconf('.', 'PC_NAME_MAX')+1) ) 
    ]

c_dirent_p = POINTER(c_dirent)
c_dir_p = POINTER(c_dir)

c_lib = CDLL(find_library("c"))
opendir = c_lib.opendir
opendir.argtypes = [c_char_p]
opendir.restype = c_dir_p

# FIXME Should probably use readdir_r here
readdir = c_lib.readdir
readdir.argtypes = [c_dir_p]
readdir.restype = c_dirent_p

closedir = c_lib.closedir
closedir.argtypes = [c_dir_p]
closedir.restype = c_int

def listdir(path):
    """
    A generator to return the names of files in the directory passed in
    """
    dir_p = opendir(".")
    try:
        while True:
            p = readdir(dir_p)
            if not p:
                break
            name = p.contents.d_name
            if name not in (".", ".."):
                yield name
    finally:
        closedir(dir_p)


if __name__ == "__main__":
    for name in listdir("."):
        print name