使用R下载压缩数据文件，解压并导入数据。

Question

使用R下载压缩数据文件，解压并导入数据。

156

@EZGraphs在Twitter上写道： "许多在线csv文件都是压缩的。有没有一种方法可以使用R下载、解压缩归档文件并将数据加载到data.frame中？ #Rstats"

我今天也在尝试做这个，但最终只是手动下载了zip文件。

我尝试了类似以下的内容：

fileName <- "http://www.newcl.org/data/zipfiles/a1.zip"
con1 <- unz(fileName, filename="a1.dat", open = "r")

但我感觉我还有很长的路要走。你有什么想法吗？

- Jeromy Anglim

它起作用了吗？如果是这样，为什么你仍然觉得你还有很长的路要走呢？ - FrustratedWithFormsDesigner

@沮丧...不行。我提出的代码无法运行。请看下面的答案。 - Jeromy Anglim

下面添加了使用库（archive）的解决方案 - 对我来说，这是最快的选项，它还允许在不必先解压整个存档的情况下读取存档中特定的csv文件。 - Tom Wenseleers

10个回答

34

仅供参考，我尝试将Dirk的答案转换为代码 :-P

temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
con <- unz(temp, "a1.dat")
data <- matrix(scan(con),ncol=4,byrow=TRUE)
unlink(temp)

- George Dontas

7

不要使用scan()函数；你可以直接在连接上使用read.table()等函数。请参考我的编辑答案。 - Dirk Eddelbuettel

23

我使用了CRAN包中的"downloader"，可以在http://cran.r-project.org/web/packages/downloader/index.html找到。简单得多。

download(url, dest="dataset.zip", mode="wb") 
unzip ("dataset.zip", exdir = "./")

- unixcreeper

2

我只使用utils::unzip，对于我来说不需要downloader包。 - mtelesha

截至2019年 - 我不得不说exdir ='.' - userJT

13

对于Mac（我认为Linux也一样）...

如果zip归档文件只包含一个文件，您可以使用bash命令funzip，结合来自data.table包的fread：

library(data.table)
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | funzip")

如果存档文件包含多个文件，您可以使用 tar来提取特定的文件到标准输出：

在此情况下，您可以使用tar而不是cat将特定文件提取到标准输出。

dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | tar -xf- --to-stdout *a1.dat")

- dnlbrky

当我尝试使用你的解决方案处理多个文件时，出现了一个错误：文件为空：。 - bshelt141

两者都没有起作用。唯一有效的是下面这个：read_csv(archive_read("http://www.newcl.org/data/zipfiles/a1.zip", file = 1), col_types = cols())。此外，唯一从shinyapps.io直接读取.zip文件的方法。 - IVIM

11

这里有一个适用于无法使用read.table函数读取的文件的示例。该示例读取一个 .xls 文件。

url <-"https://www1.toronto.ca/City_Of_Toronto/Information_Technology/Open_Data/Data_Sets/Assets/Files/fire_stns.zip"

temp <- tempfile()
temp2 <- tempfile()

download.file(url, temp)
unzip(zipfile = temp, exdir = temp2)
data <- read_xls(file.path(temp2, "fire station x_y.xls"))

unlink(c(temp, temp2))

- ColinTea

8

使用library(archive)，可以直接读取压缩文件中的特定csv文件，无需先解压缩，例如：read_csv(archive_read("http://www.newcl.org/data/zipfiles/a1.zip", file = 1), col_types = cols())。我觉得这种方法更方便、更快速。

该库支持所有主要的压缩格式，并且比基本的R untar或unz要快很多，支持tar、ZIP、7-zip、RAR、CAB、gzip、bzip2、compress、lzma、xz和uuencoded文件。

要解压缩所有东西，可以使用archive_extract("http://www.newcl.org/data/zipfiles/a1.zip", dir=XXX)。

该方法适用于所有平台，而且对于我来说性能更好，因此是首选。

- Tom Wenseleers

1

这是个非常好的答案！—它让我能够从shinyapp中读取一个远程的.zip文件，这是其他答案无法做到的。另外，一个提示：你确实需要在这里使用readr::read_csv(...)和readr::cols()。我尝试了data.table::fread(...)但它没有工作。 - IVIM

1

这对我来说也是最简单的方法！其他几种方法对我要打开的压缩文件也没有起作用。 - undefined

@tnt 很高兴它对你有用！如果OP喜欢的话，请告诉他把这个答案选为最佳答案... 这个网站上的已验证答案通常已经过时了... - undefined

6

使用data.table实现这个功能，我发现以下方法有效。不幸的是，该链接已经失效，因此我使用了另一个数据集的链接。

library(data.table)
temp <- tempfile()
download.file("https://www.bls.gov/tus/special.requests/atusact_0315.zip", temp)
timeUse <- fread(unzip(temp, files = "atusact_0315.dat"))
rm(temp)

我知道这可以在一行代码中完成，因为你可以将bash脚本传递给fread，但我不确定如何下载一个.zip文件，解压并从中传递单个文件给fread。

- Mallick Hossain

4

尝试使用以下代码。这对我有效：

unzip(zipfile="<directory and filename>",
      exdir="<directory where the content will be extracted>")

例子：

unzip(zipfile="./data/Data.zip",exdir="./data")

- Marcelo Tibau

1

rio()非常适合这个任务 - 它使用文件名的扩展名来确定文件类型，因此可以处理各种类型的文件。我还使用了unzip()列出zip文件中的文件名，因此不需要手动指定文件名。

library(rio)

# create a temporary directory
td <- tempdir()

# create a temporary file
tf <- tempfile(tmpdir=td, fileext=".zip")

# download file from internet into temporary location
download.file("http://download.companieshouse.gov.uk/BasicCompanyData-part1.zip", tf)

# list zip archive
file_names <- unzip(tf, list=TRUE)

# extract files from zip file
unzip(tf, exdir=td, overwrite=TRUE)

# use when zip file has only one file
data <- import(file.path(td, file_names$Name[1]))

# use when zip file has multiple files
data_multiple <- lapply(file_names$Name, function(x) import(file.path(td, x)))

# delete the files and directories
unlink(td)

- camnesia

0

我发现以下方法适用于我。这些步骤来自BTD的YouTube视频，在R中管理Zip文件：

zip.url <- "url_address.zip"

dir <- getwd()

zip.file <- "file_name.zip"

zip.combine <- as.character(paste(dir, zip.file, sep = "/"))

download.file(zip.url, destfile = zip.combine)

unzip(zip.file)

- Gian Zlupko

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Dirk Eddelbuettel · Accepted Answer

ZIP归档实际上更像是具有内容元数据等的“文件系统”。有关详细信息，请参见help（unzip）。因此，要执行您上面概述的操作，您需要：

创建一个临时文件名（例如tempfile（））
使用download.file（）将文件下载到临时文件中
使用unz（）从临时文件中提取目标文件
通过unlink（）删除临时文件

在代码中（感谢基本示例，但这更简单）看起来像：

temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)

压缩（.z）或 gzipped （.gz）或 bzip2ed （.bz2）文件只是文件，您可以直接从连接中读取它们。所以让数据提供者使用它们而不是原始文件 :)