R、GET和GZ压缩

Question

R、GET和GZ压缩

6

我正在为RESTful API构建客户端。有些链接允许我从服务器下载附件（文件），并且在最佳情况下，这些是.txt格式的。我之所以提到RESTful部分，是因为它意味着我必须在每个post中发送一些标题和可能的正文 - 标准R“filename”= URL逻辑无法工作。

有时人们将许多txt捆绑成一个zip。这些很麻烦，因为在下载许多文件之前，我不知道它们包含什么。

暂时，我正在解压它们，对文件进行gzip压缩（添加.gz扩展名）并重新上传它们。然后可以对它们进行索引和下载。

我正在使用Hadley的可爱的httr软件包，但是我看不到一种优雅的方法来解压gz文件。

当使用read.csv或类似方法时，任何具有.gz结尾的文件都会自动解压缩（方便！）。在使用httr或curl时，相应的方法是什么？

content(GET("http://glimmer.rstudio.com/alexbbrown/gz/sample.txt.gz"))
[1] 1f 8b 08 08 4e 9e 9b 51 00 03 73 ...

看起来很好，一个带有正确头部 (1f 8b) 的压缩字节流。现在我需要文本内容，所以我尝试使用memDecompress，它应该是这样做的：

memDecompress(content(GET("http://glimmer.rstudio.com/alexbbrown/gz/sample.txt.gz")),type="gzip")
Error in memDecompress(content(GET("http://glimmer.rstudio.com/alexbbrown/gz/sample.txt.gz")),  : 
  internal error -3 in memDecompress(2)

这里有什么适当的解决方案？

另外，是否有办法让R获取远程.zip文件的索引而不必下载全部内容？

- Alex Brown

2个回答

4

您可以添加一个解析器来处理MIME类型。查看?content和这行代码：您可以通过将适当的函数添加到httr ::：parser来添加新的解析器

ls(httr:::parsers)

#[1] "application/json"                  "application/x-www-form-urlencoded" #"image/jpeg"                       
#[4] "image/png"                         "text/html"                         #"text/plain"                       
#[7] "text/xml"

我们可以添加一个处理gz内容的函数。目前我没有比你给出的更好的答案，所以你可以将你的函数加入其中。

assign("application/octet-stream", function(x, ...) {scan(gzcon(rawConnection(x)),"",,,"\n")},envir = httr:::parsers)

content(GET("http://glimmer.rstudio.com/alexbbrown/gz/sample.txt.gz"), as = "parsed")

Read 1 item
[1] "These are not the droids you are looking for"
>

更新：我自己研究出了一种替代方法：

assign("application/octet-stream", function(x, ...) {f <- tempfile(); writeBin(x,f);untar(f);readLines(f, warn = FALSE)},envir = httr:::parsers)

content(GET("http://glimmer.rstudio.com/alexbbrown/gz/sample.txt.gz"), as = "parsed")
#[1] "These are not the droids you are looking for"

关于列出存档文件，也许您可以稍微调整一下函数。如果我们尝试获取httr源文件。它们具有mime类型"application/x-gzip"。

assign("application/x-gzip", function(x, ...) {
    f <- tempfile(); 
    writeBin(x,f); 
    if(!is.null(list(...)$list)){
        if(list(...)$list){
            return(untar(f, list = TRUE))
        }else{
            untar(f, ...);
            readLines(f)
        }
    }else{
        untar(f, ...);
        readLines(f)
    }
}, envir = httr:::parsers)

content(GET("http://cran.r-project.org/src/contrib/httr_0.2.tar.gz"), as = "parsed", list = TRUE)

# > head(content(GET("http://cran.r-project.org/src/contrib/httr_0.2.tar.gz"), as = "parsed", list = TRUE))
#[1] "httr/"                 "httr/MD5"              "httr/tests/"          
#[4] "httr/tests/test-all.R" "httr/README.md"        "httr/R/"

- user1609452

太棒了！就获取索引而言，我真的很想在不下载的情况下阅读索引，我认为这将需要部分获取等操作。我怀疑R是否包含使用寻址的zip解码器，可以直接读取索引。也许需要自己编写。 - Alex Brown

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alex Brown · Accepted Answer

以下方法可以用，但似乎有点复杂：

> scan(gzcon(rawConnection(content(GET("http://glimmer.rstudio.com/alexbbrown/gz/sample.txt.gz")))),"",,,"\n")
Read 1 item
[1] "These are not the droids you are looking for"