使用Java创建包含未知大小条目的tar归档文件

Question

使用Java创建包含未知大小条目的tar归档文件

5

我有一个网络应用程序，需要向用户提供多个文件的归档服务。我设置了一个通用的ArchiveExporter，并创建了一个ZipArchiveExporter。非常好用！我可以将数据流式传输到服务器，并将数据进行归档并流式传输到用户，而不需要使用太多的内存，并且不需要文件系统（我在Google App Engine上）。

然后我想起来整个zip64事情和4GB的zip文件。我的档案可能变得非常大（高分辨率图像），因此我希望有一个选项可以避免对较大的输入使用zip文件。

我查看了org.apache.commons.compress.archivers.tar.TarArchiveOutputStream ，认为我已经找到了所需的内容！可悲的是，当我检查文档并遇到一些错误时；我很快发现您无法预先知道每个条目的大小，这是一个问题，因为数据被流式传输给我而没有任何方法来知道大小。

我尝试计算并返回export()的写入字节数，但TarArchiveOutputStream在写入之前需要TarArchiveEntry中的大小，所以这显然行不通。

我可以使用ByteArrayOutputStream并在写入其内容之前完全读取每个条目，以便我知道它的大小，但我的条目可能会变得非常大；而且这对于运行在实例上的其他进程来说并不好。

我可以使用某种形式的持久性，上传条目并查询数据大小。但是，那将浪费我的谷歌存储API调用、带宽、存储和运行时间。

我知道有一个this SO问题几乎问了同样的问题，但他最终使用了zip文件，并没有更多相关信息。

创建具有未知大小条目的tar归档的理想解决方案是什么？

public abstract class ArchiveExporter<T extends OutputStream> extends Exporter { //base class
    public abstract void export(OutputStream out); //from Exporter interface
    public abstract void archiveItems(T t) throws IOException;
}

public class ZipArchiveExporter extends ArchiveExporter<ZipOutputStream> { //zip class, works as intended
    @Override
    public void export(OutputStream out) throws IOException {
        try(ZipOutputStream zos = new ZipOutputStream(out, Charsets.UTF_8)) {
            zos.setLevel(0);
            archiveItems(zos);
        }
    }
    @Override
    protected void archiveItems(ZipOutputStream zos) throws IOException {
        zos.putNextEntry(new ZipEntry(exporter.getFileName()));
        exporter.export(zos);
        //chained call to export from other exporter like json exporter for instance
        zos.closeEntry();
    }
}

public class TarArchiveExporter extends ArchiveExporter<TarArchiveOutputStream> {
    @Override
    public void export(OutputStream out) throws IOException {
        try(TarArchiveOutputStream taos = new TarArchiveOutputStream(out, "UTF-8")) {
            archiveItems(taos);
        }
    }
    @Override
    protected void archiveItems(TarArchiveOutputStream taos) throws IOException {
        TarArchiveEntry entry = new TarArchiveEntry(exporter.getFileName());
        //entry.setSize(?);
        taos.putArchiveEntry(entry);
        exporter.export(taos);
        taos.closeArchiveEntry();
    }
}

编辑，这是我使用ByteArrayOutputStream的想法。这个方案虽然可行，但我无法保证每次都有足够的内存来一次性存储整个条目，因此我正在尝试流式传输。肯定有更优雅的方法来流式传输tarball！也许这是一个更适合Code Review的问题？

protected void byteArrayOutputStreamApproach(TarArchiveOutputStream taos) throws IOException {
    TarArchiveEntry entry = new TarArchiveEntry(exporter.getFileName());
    try(ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
        exporter.export(baos);
        byte[] data = baos.toByteArray();
        //holding ENTIRE entry in memory. What if it's huge? What if it has more than Integer.MAX_VALUE bytes? :[
        int len = data.length;
        entry.setSize(len);
        taos.putArchiveEntry(entry);
        taos.write(data);
        taos.closeArchiveEntry();
    }
}

编辑：这就是我所说的将条目上传到媒体（在这种情况下为Google Cloud Storage）以准确查询整个大小的意思。看起来这似乎是一个简单的问题，但这不会像上面的解决方案那样遭受内存问题。只需要付出带宽和时间的代价。我希望比我聪明的人很快就会出现，让我感到愚蠢:D

protected void googleCloudStorageTempFileApproach(TarArchiveOutputStream taos) throws IOException {
    TarArchiveEntry entry = new TarArchiveEntry(exporter.getFileName());
    String name = NameHelper.getRandomName(); //get random name for temp storage
    BlobInfo blobInfo = BlobInfo.newBuilder(StorageHelper.OUTPUT_BUCKET, name).build(); //prepare upload of temp file
    WritableByteChannel wbc = ApiContainer.storage.writer(blobInfo); //get WriteChannel for temp file
    try(OutputStream out = Channels.newOutputStream(wbc)) {
        exporter.export(out); //stream items to remote temp file
    } finally {
        wbc.close();
    }

    Blob blob = ApiContainer.storage.get(blobInfo.getBlobId());
    long size = blob.getSize(); //accurately query the size after upload
    entry.setSize(size);
    taos.putArchiveEntry(entry);

    ReadableByteChannel rbc = blob.reader(); //get ReadChannel for temp file
    try(InputStream in = Channels.newInputStream(rbc)) {
        IOUtils.copy(in, taos); //stream back to local tar stream from remote temp file 
    } finally {
        rbc.close();
    }
    blob.delete(); //delete remote temp file

    taos.closeArchiveEntry();
}

- MeetTitan

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- AMADANON Inc. · Accepted Answer

我一直在研究一个类似的问题，据我所知，这是tar文件格式的限制。

Tar文件以流的形式编写，元数据（文件名、权限等）在文件数据之间编写（即元数据1、文件数据1、元数据2、文件数据2等）。提取数据的程序读取元数据1，然后开始提取文件数据1，但它必须知道何时完成。这可以通过多种方式来完成；tar通过在元数据中包含长度来实现。

根据您的需求和接收方的期望，我看到有几个选项可供选择（并非所有选项都适用于您的情况）：

如您所述，加载整个文件，计算长度，然后发送它。
将文件分成预定义长度的块（适合内存），然后将它们打包为file1-part1、file1-part2等；最后一个块会比较短。
将文件分成预定义长度的块（不需要适合内存），然后使用适当的方式填充最后一个块的大小。
计算文件的最大可能大小，并填充到该大小。
使用不同的归档格式。
制作自己的归档格式，没有这种限制。

有趣的是，gzip没有预定义限制，可以将多个gzip串联在一起，每个gzip都有自己的“原始文件名”。不幸的是，标准的gunzip将所有结果数据提取到一个文件中，使用第一个文件名。