有没有可能只下载ZIP归档文件的一部分(例如一个文件)?

18

是否有一种方法可以在不下载整个文件的情况下只下载.rar或.zip文件的部分内容?

有一个包含A、B、C和D文件的ZIP文件。 我只需要A。我能否通过某种方式将下载调整为仅下载A,或者如果可能,在服务器本身中提取文件并仅获取A?


尽管标题有些傻,但我认为这是一个相当不错的问题。是的,“可能”。然而,所需的工作量并不小...对于最终用户来说,“不可行”(除非已经有人创建了这样的工具)。 - user166390
这在很大程度上取决于您的传输协议 - 您显然需要使用一种可以传输文件范围而不仅仅是完整文件的协议。例如,如果您的传输协议是NFS,则可能会发现标准归档工具正在透明地执行此操作。 - Toby Speight
6个回答

12
The trick is to do what Sergio suggests without doing it manually. This is easy if you mount the ZIP file via an HTTP-backed virtual filesystem and then use the standard unzip command on it. This way the unzip utility's I/O calls are translated to HTTP range GETs, which means only the chunks of the ZIP file that you want get transferred over the network.
Here's an example for Linux using HTTPFS, a very lightweight virtual filesystem (it uses FUSE). There are similar tools for Windows. 获取/构建httpfs:
$ wget http://sourceforge.net/projects/httpfs/files/httpfs/1.06.07.02
$ mv 1.06.07.10 httpfs_1.06.07.10.tar.bz2
$ tar -xjf httpfs_1.06.07.10.tar.bz2
$ rm httpfs
$ ./make_httpfs

挂载远程ZIP文件并从中提取一个文件:
$ mkdir mount_pt
$ sudo ./httpfs http://server.com/zipfile.zip mount_pt
$ sudo ls mount_pt
zipfile.zip
$ sudo unzip -p mount_pt/zipfile.zip the_file_I_want.txt > the_file_I_want.txt
$ sudo umount mount_pt

当然,你也可以使用除命令行工具之外的任何其他工具(我需要sudo,因为它在我的机器上似乎是这样设置的,你不应该需要它)。

2
为什么要使用 sudo - Marian
有没有更简单的解决方案?我尝试过这个,但是在挂载点上会出现烦人的错误。另外,如何列出zip文件的内容,以便首先知道我们要定位的确切文件名? - Louis
httpfs在SourceForge上更改了文件名。请使用以下命令替换前两个命令: wget https://sourceforge.net/projects/httpfs/files/httpfs/1.06.07.02/httpfs_1.06.07.10.tar.bz2 - Shervin Emami

9
在某种程度上,是可以的。
ZIP文件格式说明了有一个“中央目录”。基本上,这是一个表格,存储了压缩包中有哪些文件以及它们的偏移量。
因此,使用Content-Range,您可以从末尾下载文件的一部分(中央目录是ZIP文件中的最后一个内容),并尝试在其中识别中央目录。如果成功,则知道文件列表和偏移量,因此可以继续单独获取这些块并自行解压缩。
这种方法容易出错,不能保证有效。但黑客攻击本来就是如此 :-)
另一种可能的方法是为此构建自定义服务器(有关详细信息,请参见pst的回答)。

我想知道是否有一个库可以将HTTP内容范围请求映射为某种变态的流IO... :) (实际上,这是可能的[fsvo],如所描述的那样,对于许多接受流输入的语言来说。不过这不是我想要涉及的东西。) - user166390
2
这不是黑客行为,而是正确完成任务的方法。实际上,在这里,HTTP只是访问ZIP流的一种方式,任何可以使用流来提取远程流中的单个文件的ZIP组件都可以使用。 - Eugene Mayevski 'Callback
@EugeneMayevski'EldoSCorp 是的,你可能是对的,我没有这么看过 :-) - Sergio Tulentsev

3

我想知道,partial-zip 对你是否有效。对我来说,它似乎是一个不错的承诺,但并没有给我带来任何东西。 - Jan Vlcinsky

0

我认为Sergio Tulentsev的想法非常棒。

但是,如果可以控制服务器-例如,可以部署自定义代码-那么在事情的计划中,映射/处理请求,提取ZIP归档的相关部分并将数据发送回HTTP流是相当琐碎的操作 :)

请求可能看起来像:

http://foo.bar/myfile.zip_a.jpeg

这意味着从"myfile.zip"中提取并返回"a.jpeg"。
(我故意选择了这种愚蠢的格式,以便浏览器在下载对话框中出现时可能会选择"myfile.zip_a.jpeg"作为名称。)
当然,如何实现这取决于服务器/语言/框架,可能已经存在支持类似操作的现有解决方案(但我不知道)。

0

您可以安排您的文件出现在ZIP文件的后面。

下载100k:

$ curl -r -100000 https://www.keepassx.org/releases/2.0.2/KeePassX-2.0.2.zip -o tail.zip
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                             Dload  Upload   Total   Spent    Left  Speed
100   97k  100   97k    0     0  84739      0  0:00:01  0:00:01 --:--:-- 84817

检查我们得到了哪些文件:

$ unzip -t tail.zip
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)
error [tail.zip]:  attempt to seek before beginning of zipfile
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)
error [tail.zip]:  attempt to seek before beginning of zipfile
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)
error [tail.zip]:  attempt to seek before beginning of zipfile
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)
error [tail.zip]:  attempt to seek before beginning of zipfile
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)
    testing: KeePassX-2.0.2/share/translations/keepassx_uk.qm   OK
    testing: KeePassX-2.0.2/share/translations/keepassx_zh_CN.qm   OK
    testing: KeePassX-2.0.2/share/translations/keepassx_zh_TW.qm   OK
    testing: KeePassX-2.0.2/zlib1.dll   OK
At least one error was detected in tail.zip.

然后提取最后一个文件:

$ unzip tail.zip KeePassX-2.0.2/zlib1.dll
Archive:  tail.zip
error [tail.zip]:  missing 7751495 bytes in zipfile
  (attempting to process anyway)
  inflating: KeePassX-2.0.2/zlib1.dll

0

根据良好的输入,我已经编写了一个Powershell代码片段来展示它如何工作:

# demo code downloading a single DLL file from an online ZIP archive
# and extracting the DLL into memory to mount it finally to the main process.

cls
Remove-Variable * -ea 0

# definition for the ZIP archive, the file to be extracted and the checksum:
$url = 'https://github.com/sshnet/SSH.NET/releases/download/2020.0.1/SSH.NET-2020.0.1-bin.zip'
$sub = 'net40/Renci.SshNet.dll'
$md5 = '5B1AF51340F333CD8A49376B13AFCF9C'

# prepare HTTP client:
Add-Type -AssemblyName System.Net.Http
$handler = [System.Net.Http.HttpClientHandler]::new()
$client  = [System.Net.Http.HttpClient]::new($handler)

# get the length of the ZIP archive:
$req = [System.Net.HttpWebRequest]::Create($url)
$req.Method = 'HEAD'
$length = $req.GetResponse().ContentLength
$zip = [byte[]]::new($length)

# get the last 10k:
# how to get the correct length of the central ZIP directory here?
$start = $length-10kb
$end   = $length-1
$client.DefaultRequestHeaders.Add('Range', "bytes=$start-$end")
$result = $client.GetAsync($url).Result
$last10kb = $result.content.ReadAsByteArrayAsync().Result
$last10kb.CopyTo($zip, $start)

# get the block containing the DLL file:
# how to get the exact file-offset from the ZIP directory?
$start = $length-3537kb
$end   = $length-3201kb
$client.DefaultRequestHeaders.Clear()
$client.DefaultRequestHeaders.Add('Range', "bytes=$start-$end")
$result = $client.GetAsync($url).Result
$block = $result.content.ReadAsByteArrayAsync().Result
$block.CopyTo($zip, $start)

# extract the DLL file from archive:
Add-Type -AssemblyName System.IO.Compression
$stream = [System.IO.Memorystream]::new()
$stream.Write($zip,0,$zip.Length)
$archive = [System.IO.Compression.ZipArchive]::new($stream)
$entry = $archive.GetEntry($sub)
$bytes = [byte[]]::new($entry.Length)
[void]$entry.Open().Read($bytes, 0, $bytes.Length)

# check MD5:
$prov = [Security.Cryptography.MD5CryptoServiceProvider]::new().ComputeHash($bytes)
$hash = [string]::Concat($prov.foreach{$_.ToString("x2")})
if ($hash -ne $md5) {write-host 'dll has wrong checksum.' -f y ;break}

# load the DLL:
[void][System.Reflection.Assembly]::Load($bytes)

# use the single demo-call from the DLL:
$test = [Renci.SshNet.NoneAuthenticationMethod]::new('test')
'done.'

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接