使用Node.js处理大文件的获取

Question

使用Node.js处理大文件的获取

javascriptnode.jsnode-modulesnode.js-stream

3

我有一个Node.js应用程序，需要从Census.gov获取一个6GB的zip文件并处理其内容。但是，在使用Node.js的https API获取文件时，下载会在不同的文件大小停止。有时它在2GB或1.8GB处失败等等。我无法通过应用程序完全下载该文件，但在使用浏览器时，它已被完全下载。是否有任何方法可以完整地下载文件？在完全下载之前，我无法开始处理该压缩包，因此我的处理代码会等待下载完成才执行。

const file = fs.createWriteStream(fileName);
http.get(url).on("response", function (res) {
      let downloaded = 0;
      res
        .on("data", function (chunk) {
          file.write(chunk);
          downloaded += chunk.length;
          process.stdout.write(`Downloaded ${(downloaded / 1000000).toFixed(2)} MB of ${fileName}\r`);
        })
        .on("end", async function () {
          file.end();
          console.log(`${fileName} downloaded successfully.`);
        });
    });

- Sam

控制台有没有打印任何错误？您想对文件进行什么样的处理？ - niceman

这个回答解决了你的问题吗？什么是在NodeJS中下载大文件的最佳方法？ - Christopher

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jfriend00 · Accepted Answer

你在file.write(chunk)上没有流量控制。你需要注意从file.write(chunk)返回的值，当它返回false时，你必须等待drain事件再写入更多内容。否则，当向诸如磁盘这样的缓慢介质写入大型内容时，你可能会使写入流的缓冲区溢出。

当你尝试以比磁盘速度更快的速度写入大型内容时缺乏流量控制，你可能会因为流必须在其缓冲区中积累更多数据而爆炸式增加内存使用量。

由于你的数据来自一个可读流，在从file.write(chunk)获得false时，你还必须暂停传入的读取流，以便在等待写入流上的drain事件时，它不会继续向你喷发数据事件。当你获得drain事件时，你可以resume读取流。

如果您不需要进度信息，可以让 pipeline() 完成所有工作（包括流控制），您无需自己编写代码。使用 pipeline() 时，甚至仍然可以通过观察写入流活动来收集进度信息。

以下是一种实现自己流控制的方法，但如果可能的话，我建议您使用流模块中的 pipeline() 函数，并让它为您完成所有这些工作：

const file = fs.createWriteStream(fileName);
file.on("error", err => console.log(err));
http.get(url).on("response", function(res) {
    let downloaded = 0;
    res.on("data", function(chunk) {
        let readyForMore = file.write(chunk);
        if (!readyForMore) {
            // pause readstream until drain event comes
            res.pause();
            file.once('drain', () => {
                res.resume();
            });
        }
        downloaded += chunk.length;
        process.stdout.write(`Downloaded ${(downloaded / 1000000).toFixed(2)} MB of ${fileName}\r`);
    }).on("end", function() {
        file.end(); console.log(`${fileName} downloaded successfully.`);
    }).on("error", err => console.log(err));
});

似乎在http请求中出现了超时问题。当我添加了以下内容时：

// set client timeout to 24 hours
res.setTimeout(24 * 60 * 60 * 1000);

我随后成功下载了整个7GB的ZIP文件。

以下是对我有效的一键式代码：

const fs = require('fs');
const https = require('https');
const url =
    "https://www2.census.gov/programs-surveys/acs/summary_file/2020/data/5_year_entire_sf/All_Geographies_Not_Tracts_Block_Groups.zip";
const fileName = "census-data2.zip";

const file = fs.createWriteStream(fileName);
file.on("error", err => {
    console.log(err);
});
const options = {
    headers: {
        "accept-encoding": "gzip, deflate, br",
    }
};
https.get(url, options).on("response", function(res) {
    const startTime = Date.now();

    function elapsed() {
        const delta = Date.now() - startTime;
        // convert to minutes
        const mins = (delta / (1000 * 60));
        return mins;
    }

    let downloaded = 0;
    console.log(res.headers);
    const contentLength = +res.headers["content-length"];
    console.log(`Expecting download length of ${(contentLength / (1024 * 1024)).toFixed(2)} MB`);
    // set timeout to 24 hours
    res.setTimeout(24 * 60 * 60 * 1000);
    res.on("data", function(chunk) {
        let readyForMore = file.write(chunk);
        if (!readyForMore) {
            // pause readstream until drain event comes
            res.pause();
            file.once('drain', () => {
                res.resume();
            });
        }
        downloaded += chunk.length;
        const downloadPortion = downloaded / contentLength;
        const percent = downloadPortion * 100;
        const elapsedMins = elapsed();
        const totalEstimateMins = (1 / downloadPortion) * elapsedMins;
        const remainingMins = totalEstimateMins - elapsedMins;

        process.stdout.write(
            `  ${elapsedMins.toFixed(2)} mins, ${percent.toFixed(1)}% complete, ${Math.ceil(remainingMins)} mins remaining, downloaded ${(downloaded / (1024 * 1024)).toFixed(2)} MB of ${fileName}                                 \r`
        );
    }).on("end", function() {
        file.end();
        console.log(`${fileName} downloaded successfully.`);
    }).on("error", err => {
        console.log(err);
    }).on("timeout", () => {
        console.log("got timeout event");
    });
});