高效地从.tar归档文件中提取单个文件

Question

高效地从.tar归档文件中提取单个文件

5

我有一个大小为2GB的.tgz文件。

我想从.tgz文件中提取一个大小为2KB的.txt文件。

我有以下代码：

import tarfile
from contextlib import closing

with closing(tarfile.open("myfile.tgz")) as tar:
    subdir_and_files = [
        tarinfo for tarinfo in tar.getmembers()
        if tarinfo.name.startswith("myfile/first/second/text.txt")
        ]
    print subdir_and_files
    tar.extractall(members=subdir_and_files)

问题在于我需要等待至少一分钟才能获取提取后的文件。似乎extractall会将所有文件都提取出来，但只保存我所需的一个文件。

有没有更有效的方法来实现这个目标？

- MIDE11

1

也许 https://github.com/devsnd/tarindexer 可以帮到你。我自己没有时间尝试它。 - Paul Rooney

1

当您调用getmembers()时，tarfile会扫描整个文件。尝试迭代tarfile对象。但是，如果目标文件在结尾处，您仍然可能会扫描整个文件。Tar文件没有随机访问索引。 - dhke

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ярослав Рахматуллин · Answer 1

不行。

tar格式不适合快速提取单个文件。在大多数情况下，这种情况会加剧，因为tar文件通常是在压缩流中的。我建议使用7z。

有点。

如果您知道只有一个文件具有该名称，或者如果您只想要一个文件，则可以在第一次命中后中止提取过程。

例如：

完全扫描该物品。

$ time tar tf /var/log/apache2/old/2016.tar.xz 
2016/
2016/access.log-20161023
2016/access.log-20160724
2016/ssl_access.log-20160711
2016/error.log-20160815
(...)
2016/error.log-20160918
2016/ssl_request.log-20160814
2016/access.log-20161017
2016/access.log-20160516
time: Real 0m1.5s  User 0m1.4s  System 0m0.2s

从内存中扫描该物件

$ time tar tf /var/log/apache2/old/2016.tar.xz  > /dev/null 
time: Real 0m1.3s  User 0m1.2s  System 0m0.2s

在第一个文件后中止

$ time tar tf /var/log/apache2/old/2016.tar.xz  | head -n1 
2016/
time: Real 0m0.0s  User 0m0.0s  System 0m0.0s

三个文件后中止

$ time tar tf /var/log/apache2/old/2016.tar.xz  | head -n3 
2016/
2016/access.log-20161023
2016/access.log-20160724
time: Real 0m0.0s  User 0m0.0s  System 0m0.0s

在“中间”某个文件后终止

$ time tar xf /var/log/apache2/old/2016.tar.xz  2016/access.log-20160724  | head -n1 
time: Real 0m0.9s  User 0m0.9s  System 0m0.1s

在“底部”某个文件后中止

$ time tar xf /var/log/apache2/old/2016.tar.xz  2016/access.log-20160516  | head -n1 
time: Real 0m1.1s  User 0m1.1s  System 0m0.2s

我在这里向您展示，如果您在第一行（head -n1）退出后杀死GNU tar的输出管道（标准输出），则tar进程也会终止。

您可以看到，读取整个存档比在接近存档“底部”的某个文件后中止要花费更多时间。您还可以看到，在遇到顶部的文件后中止读取需要的时间明显较少。

如果我能决定存档的格式，我就不会这样做。

所以...

不要使用Python开发者非常喜欢的列表推导式，而是迭代tar.getmembers()（或者使用该库提供一个文件一次的任何方法），并在遇到所需结果时中断迭代，而不是将所有文件都展开到列表中。