递归列出FTP服务器的文件列表

12

有没有 list.files(path, recursive=TRUE) 的FTP版本?

我想获取此FTP服务器子目录中所有ZIP档案的URL。

url <- "ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/"

所以我想要获得这个目录中所有文件的列表:
ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/wind/recent/ 以及
ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/air_temperature/historical/ 等等

使用RCurl,我成功下载了这个目录的目录列表,但无法获取所有子目录中的所有zip归档文件的综合列表。 有没有其他建议?除了循环遍历目录并逐个获取目录列表?

迄今为止的RCurl代码:

dwd_dirlist <- function(url, full = TRUE){
  dir <- unlist(
    strsplit(
      getURL(url,
             ftp.use.epsv = FALSE,
             dirlistonly = TRUE),
      "\n")
    )
  if(full) dir <- paste0(url, dir)
  return(dir)
}

1
如果您可以访问ncftp,则可以使用"recursive"选项外壳到ncftpls。还有其他通过shell工具完成此操作的方法。否则,我相信您最终将编写自己的递归列表生成器。 - hrbrmstr
你是用它来获取和读取多个文件吗?那么新的 rdwd 包可能会有所帮助:https://github.com/brry/rdwd#rdwd。它包括一个观测气候数据的文件索引,一个递归列出 FTP 目录的功能 (indexDWD),以及一个天气站互动地图。 - Berry Boessenkool
1个回答

11
如果您在系统上安装了 lftp 工具,则可以使用其 find 命令递归列出指定目录下的文件。这里是一个指向文档的链接find 的描述在顶部附近。
不幸的是,正如您从文档中看到的那样,与普通的 Unix find 工具不同,lftpfind 命令几乎不支持任何选项,只有 --max-depth--list(用于长列表),因此您无法使用 -name-regex 等谓词,这些谓词通常由 find 工具提供。另一方面,lftp 支持一种非常不寻常但强大的功能,即允许您将输出管道传输到本地工具,因此您可以从 lftp 命令行内部将 find 输出管道传输到本地的 grep。当然,您也可以在 shell 管道中进行 grep 或在 Rland 中进行过滤。以下是使用 lftp 管道的示例(正如您所看到的,这种方法的缺点是多层转义变得相当复杂):
url <- 'ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/';
zips <- system(paste0('lftp ',url,' <<<\'find| grep "\\\\.zip$"; exit;\';'),intern=T);
zips;
##    [1] "./air_temperature/historical/stundenwerte_TU_00003_19500401_20110331_hist.zip"
##    [2] "./air_temperature/historical/stundenwerte_TU_00044_20070401_20141231_hist.zip"
##    [3] "./air_temperature/historical/stundenwerte_TU_00052_19760101_19880101_hist.zip"
##    [4] "./air_temperature/historical/stundenwerte_TU_00071_20091201_20141231_hist.zip"
##
## ... snip ...
##
## [6616] "./wind/recent/stundenwerte_FF_15207_akt.zip"
## [6617] "./wind/recent/stundenwerte_FF_15214_akt.zip"
## [6618] "./wind/recent/stundenwerte_FF_15444_akt.zip"
## [6619] "./wind/recent/stundenwerte_FF_15520_akt.zip"

另外,只是为了好玩,如果你想要另一种方法,我已经写了一个函数,可以使用正则表达式解析ls -l命令的输出,将所有字段返回到数据框中。简单的修改允许它使用lftp在ftp上工作:

longListing <- function(url='',recursive=F,all=F) {
    ## returns a data.frame of long-listing fields
    ## requires lftp for ftp support

    ## validate arguments
    url <- as.character(url);
    if (length(url) != 1L) stop('url argument must have length 1.');
    recursive <- as.logical(recursive);
    if (length(recursive) != 1L) stop('recursive argument must have length 1.');
    all <- as.logical(all);
    if (length(all) != 1L) stop('all argument must have length 1.');

    ## escape and single-quote url, or leave empty for pwd if empty
    urlEsc <- if (url == '') '' else paste0('\'',sub("'","'\\''",url),'\'');

    ## construct ls command with options; identical between local ls and lftp ls
    ## technically lftp ls doesn't require -l to get a long listing, but it accepts it
    lsCmd <- paste0('ls -l',if (recursive) ' -R',if (all) ' -A');

    ## run system command to get long-listing output lines
    if (substr(url,0L,6L) == 'ftp://') { ## ftp
        output <- system(paste0('lftp ',urlEsc,' <<<\'',lsCmd,'; exit;\';'),intern=T);
    } else { ## local
        output <- system(paste0(lsCmd,' ',urlEsc,';'),intern=T);
    }; ## end if

    ## define regexes for parsing the output
    ## note: accept question marks for items whose metadata cannot be read
    sp0RE <- '\\s*';
    sp1RE <- '\\s+';
    typeRE <- '([?dlcbps-])';
    rRE <- '([?r-])';
    wRE <- '([?w-])';
    xRE <- '([?xsStT-])';
    aclRE <- '([?+@]*)';
    permRE <- paste0(typeRE,rRE,wRE,xRE,rRE,wRE,xRE,rRE,wRE,xRE,aclRE);
    linksRE <- '(\\?|[0-9]+)';
    ocRE <- '[a-zA-Z_0-9.$+-]';
    ocsRE <- '[a-zA-Z_0-9 .$+-]'; ## badly-behaving names can have spaces; non-greedy will prevent excessive gobbling
    ownerRE <- paste0('(\\?|',ocRE,'|',ocRE,ocsRE,'*?',ocRE,')');
    groupRE <- ownerRE; ## same compatibility rules as owner
    sizeRE <- '(?:\\?|(?:([0-9]+),\\s*)?([0-9]+))'; ## major, minor for special files, plain size for rest
    monthRE <- '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)';
    dayRE <- '([0-9]+)';
    timeRE <- '([0-9]{2}:[0-9]{2}|[0-9]+)'; ## could be year
    dtRE <- paste0('(?:\\?|',monthRE,sp1RE,dayRE,sp1RE,timeRE,')');
    nameRE <- '(.*?)'; ## make non-greedy to allow target to be captured, if present
    targetRE <- '(?:\\s+->\\s+(.*))?'; ## target is optional; shown on some platforms, e.g. Cygwin
    recordRE <- paste0(
        '^'
        ,permRE,sp1RE
        ,linksRE,sp1RE
        ,ownerRE,sp1RE
        ,groupRE,sp1RE
        ,sizeRE,sp1RE
        ,dtRE,sp1RE
        ,nameRE,targetRE ## target is optional; targetRE defines its own whitespace separation
        ,sp0RE,'$' ## ignore trailing whitespace
    );

    ## get indexes of listing records
    recordIndexes <- grep(recordRE,output);

    ## get indexes of blanks and directory headers for maximally robust matching
    blankIndexes <- grep('^\\s*$',output);
    headerIndexes <- grep(':$',output); ## questionable specificity

    ## pare headers down to those with preceding blank
    headerIndexes <- headerIndexes[(headerIndexes-1)%in%c(0L,blankIndexes)]; ## include zero for possible first-line header

    ## match recordIndexes into headerIndexes to look up parent path; direct children will be zero
    recordHeaderIndexes <- findInterval(recordIndexes,headerIndexes);

    ## derive parent paths with trailing slash, or empty string for direct children
    parentPaths <- c('',sub(':','/',output[headerIndexes]))[recordHeaderIndexes+1L];
    parentPaths <- sub('^\\./','',parentPaths); ## for aesthetics

    ## match record lines and extract capture groups
    reg <- regmatches(output[recordIndexes],regexec(recordRE,output[recordIndexes]));

    ## build data.frame with reg fields
    ret <- data.frame(type=sapply(reg,`[`,2L),stringsAsFactors=F); ## start with type to set the row count
    i <- 3L;
    ## note: size is actually minor for character- and block-special files
    for (cn in c('ur','uw','ux','gr','gw','gx','or','ow','ox','acl','links','owner','group','major','size','month','day','time','path','target')) {
        ret[[cn]] <- sapply(reg,`[`,i);
        i <- i+1L;
    }; ## end for

    ## prepend parent paths to listing paths
    ret$path <- paste0(parentPaths,ret$path);

    ret;

}; ## end longListing()

这是我在系统上创建的一个特殊文件目录的演示:
longListing();
##    type ur uw ux gr gw gx or ow ox acl links owner group major size month day  time                      path            target
## 1     d  r  w  x  r  -  -  r  -  -   +     1  user  None          0   Feb  27 08:21                       dir
## 2     d  r  w  x  r  w  x  r  w  x   +     1  user  None          0   Feb  27 08:21        dir-other-writable
## 3     d  r  w  x  r  -  -  r  -  T   +     1  user  None          0   Feb  27 08:21                dir-sticky
## 4     d  r  w  x  r  w  x  r  w  t   +     1  user  None          0   Feb  27 08:21 dir-sticky-other-writable
## 5     -  r  w  -  r  -  -  r  -  -         2  user  None          0   Feb  27 08:21                      file
## 6     -  r  w  -  r  -  -  r  -  -         1  user  None          0   Feb  27 08:21          file-archive.tar
## 7     -  r  w  -  r  -  -  r  -  -         1  user  None          0   Feb  27 08:21            file-audio.mp3
## 8     b  r  w  -  r  w  -  r  w  -         1  user  None     0    1   Feb  27 08:21        file-block-special
## 9     c  r  w  -  r  w  -  r  w  -         1  user  None     0    1   Feb  27 08:21    file-character-special
## 10    -  r  w  x  r  w  x  r  w  x         1  user  None         12   Feb  27 08:21                  file-exe
## 11    p  r  w  -  r  w  -  r  w  -         1  user  None          0   Feb  27 08:21                 file-fifo
## 12    -  r  w  -  r  -  -  r  -  -         1  user  None          0   Feb  27 08:21            file-image.bmp
## 13    -  r  w  -  r  w  S  r  -  -         1  user  None          0   Feb  27 08:21               file-setgid
## 14    -  r  w  x  r  w  s  r  -  x         1  user  None          0   Feb  27 08:21           file-setgid-exe
## 15    -  r  w  S  r  w  -  r  -  -         1  user  None          0   Feb  27 08:21               file-setuid
## 16    -  r  w  s  r  w  x  r  -  x         1  user  None          0   Feb  27 08:21           file-setuid-exe
## 17    s  r  w  -  r  w  -  r  -  -         1  user  None          0   Feb  27 08:21               file-socket
## 18    l  r  w  x  r  w  x  r  w  x         1  user  None          4   Feb  27 08:21               ln-existing              file
## 19    -  r  w  -  r  -  -  r  -  -         2  user  None          0   Feb  27 08:21                   ln-hard
## 20    l  r  w  x  r  w  x  r  w  x         1  user  None         17   Feb  27 08:21           ln-non-existing file-non-existing

在您的网站上演示:
url <- 'ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/';
ll <- longListing(url,T,T);
ll;
##      type ur uw ux gr gw gx or ow ox acl links owner   group major    size month day  time                                                                                                  path target
## 1       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Jun   5  2014                                                                                       air_temperature
## 2       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Sep  25  2014                                                                                            cloudiness
## 3       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Nov  13  2014                                                                                         precipitation
## 4       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Nov  13  2014                                                                                              pressure
## 5       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Jun   5  2014                                                                                      soil_temperature
## 6       d  r  w  x  r  w  x  -  -  x         2 32230 ftp-dwd         12288   Dec  15 11:52                                                                                                 solar
## 7       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Jun   5  2014                                                                                                   sun
## 8       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Apr  17  2015                                                                                                  wind
## 9       d  r  w  x  r  w  x  -  -  x         2 32230 ftp-dwd        114688   Oct  15 12:35                                                                            air_temperature/historical
## 10      d  r  w  x  r  w  x  -  -  x         2 32230 ftp-dwd        151552   Dec   4 10:28                                                                                air_temperature/recent
## 11      -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd         68727   Jan  26 09:55                air_temperature/historical/BESCHREIBUNG_obsgermany_climate_hourly_tu_historical_de.pdf
## 12      -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd         68600   Jan  26 09:55                 air_temperature/historical/DESCRIPTION_obsgermany_climate_hourly_tu_historical_en.pdf
## 13      -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd        123634   Mar  27  2015                                 air_temperature/historical/TU_Stundenwerte_Beschreibung_Stationen.txt
## 14      -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd       2847045   Mar  27  2015                           air_temperature/historical/stundenwerte_TU_00003_19500401_20110331_hist.zip
## 15      -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd        359517   Mar  27  2015                           air_temperature/historical/stundenwerte_TU_00044_20070401_20141231_hist.zip
##
## ... snip ...
##
## 6683    -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd         65633   Feb  27 10:26                                                             wind/recent/stundenwerte_FF_15207_akt.zip
## 6684    -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd         66910   Feb  27 10:21                                                             wind/recent/stundenwerte_FF_15214_akt.zip
## 6685    -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd         64525   Feb  27 10:19                                                             wind/recent/stundenwerte_FF_15444_akt.zip
## 6686    -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd         23717   Feb  27 10:21                                                             wind/recent/stundenwerte_FF_15520_akt.zip

您可以轻松提取zip文件名:

zips <- ll$path[ll$type=='-' & grepl('\\.zip$',ll$path)];
length(zips);
## [1] 6619

@bgoldst,请问您如何安装 lftp 工具?我的 FTP 服务器在 Windows 机器上,我的笔记本电脑也是运行 R 脚本的 Windows 机器,尝试递归连接 FTP 服务器并获取文件,谢谢。 - Samoth
我在我的Windows笔记本上尝试了zips <- system(paste0('lftp ',url,' <<<\'find| grep "\\\\.zip$"; exit;\';'),intern=T);,但出现错误:Error in system(paste0("lftp ", url, " <<<'find| grep \"\\\\.xlsx$\"; exit\';"), : 找不到'lftp' - Samoth

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接