将目录中的所有文件合并成一个文件的Python脚本

Question

将目录中的所有文件合并成一个文件的Python脚本

pythonfilecopy

26

我已经编写了以下脚本，将目录中的所有文件连接成一个单独的文件。

这个脚本可以在以下方面进行优化：

使用符合Python语言习惯的方法
时间效率

这是代码片段：

import time, glob

outfilename = 'all_' + str((int(time.time()))) + ".txt"

filenames = glob.glob('*.txt')

with open(outfilename, 'wb') as outfile:
    for fname in filenames:
        with open(fname, 'r') as readfile:
            infile = readfile.read()
            for line in infile:
                outfile.write(line)
            outfile.write("\n\n")

- user1629366

9

优化时间使用"cat *.txt > all.txt" :) （翻译：为了节省时间，可以使用“cat *.txt > all.txt”这个命令） - w-m

可能是重复的问题：使用Python将多个文本文件合并为一个文本文件 - llb

6个回答

4

使用Python 2.7，我进行了一些“基准测试”，测试的是

outfile.write(infile.read())

vs

shutil.copyfileobj(readfile, outfile)

我迭代了20个大小从63MB到313MB的.txt文件，联合文件大小约为2.6GB。在两种方法中，普通读取模式比二进制读取模式表现更好，而shutil.copyfileobj通常比outfile.write更快。

当比较最差的组合（outfile.write，二进制模式）和最佳的组合（shutil.copyfileobj，普通读取模式）时，差异是相当显著的：

outfile.write, binary mode: 43 seconds, on average.

shutil.copyfileobj, normal mode: 27 seconds, on average.

在普通读取模式下，输出文件的最终大小为2620 MB，而在二进制读取模式下为2578 MB。

- Stephen Miller

有趣。那是什麼平台呢？ - ellockie

我大致在两个平台上工作：Linux Fedora 16，不同的节点或Windows 7 Enterprise SP1，配备Intel Core(TM)2 Quad CPU Q9550，2.83 GHz。我想这是后者。 - Stephen Miller

3

你可以直接遍历文件对象的每一行，而不必将整个文件读入内存：

with open(fname, 'r') as readfile:
    for line in readfile:
        outfile.write(line)

- Brendan Long

2

我很好奇如何提升性能，因此查看了Martijn Pieters和Stephen Miller的回答。

我尝试使用shutil进行二进制模式和文本模式，并且尝试了合并270个文件。

文本模式 -

最初的回答：

def using_shutil_text(outfilename):
    with open(outfilename, 'w') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'r') as readfile:
                shutil.copyfileobj(readfile, outfile)

def without_shutil_text(outfilename):
    with open(outfilename, 'w') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'r') as readfile:
                outfile.write(readfile.read())

二进制模式 -

最初的回答

def using_shutil_text(outfilename):
    with open(outfilename, 'wb') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'rb') as readfile:
                shutil.copyfileobj(readfile, outfile)

def without_shutil_text(outfilename):
    with open(outfilename, 'wb') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'rb') as readfile:
                outfile.write(readfile.read())

二进制模式的运行时间 -

最初的回答：

Shutil - 20.161773920059204
Normal - 17.327500820159912

文本模式下的运行时间 -

最初的回答 -

Shutil - 20.47757601737976
Normal - 13.718038082122803

看起来在两种模式下，shutil执行的操作是相同的，但文本模式比二进制模式更快。

操作系统：Mac OS 10.14 Mojave。Macbook Air 2017。

- Ravi Kumar Gupta

2

不需要使用那么多变量。

with open(outfilename, 'w') as outfile:
    for fname in filenames:
        with open(fname, 'r') as readfile:
            outfile.write(readfile.read() + "\n\n")

- MGP

1

fileinput模块提供了一种自然的方式来迭代多个文件

for line in fileinput.input(glob.glob("*.txt")):
    outfile.write(line)

- iruvar

1

如果它不仅限于一次读取一行，那就更好了。 - Marcin

@Marcin，没错。我曾经认为这是一个很酷的解决方案 - 直到我看到Martijn Pieter的shutil.copyfileobj大杀器。 - iruvar

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martijn Pieters · Accepted Answer

使用 shutil.copyfileobj 来复制数据：

import shutil

with open(outfilename, 'wb') as outfile:
    for filename in glob.glob('*.txt'):
        if filename == outfilename:
            # don't want to copy the output into the output
            continue
        with open(filename, 'rb') as readfile:
            shutil.copyfileobj(readfile, outfile)

shutil 从 readfile 对象中按块读取数据，并将它们直接写入 outfile 文件对象。请勿使用 readline() 或迭代缓冲区，因为您不需要查找行结尾的开销。

在读和写时使用相同的模式；这在使用 Python 3 时尤其重要；我在这里都使用了二进制模式。