用Haskell实现目录的递归流式下降遍历

Question

用Haskell实现目录的递归流式下降遍历

lazy-evaluationdirectory-structurehaskell

11

我正在尝试使用Haskell进行目录结构的递归下降。我希望只在需要时（惰性地）检索子目录和文件。

我编写了以下代码，但当我运行它时，跟踪显示所有目录都在第一个文件之前被访问：

module Main where

import Control.Monad ( forM, forM_, liftM )
import Debug.Trace ( trace )
import System.Directory ( doesDirectoryExist, getDirectoryContents )
import System.Environment ( getArgs )
import System.FilePath ( (</>) )

-- From Real World Haskell, p. 214
getRecursiveContents :: FilePath -> IO [FilePath]
getRecursiveContents topPath = do
  names <- getDirectoryContents topPath
  let
    properNames =
      filter (`notElem` [".", ".."]) $
      trace ("Processing " ++ topPath) names
  paths <- forM properNames $ \name -> do
    let path = topPath </> name
    isDirectory <- doesDirectoryExist path
    if isDirectory
      then getRecursiveContents path
      else return [path]
  return (concat paths)

main :: IO ()
main = do
  [path] <- getArgs
  files <- getRecursiveContents path
  forM_ files $ \file -> putStrLn $ "Found file " ++ file

我如何在下降过程中与文件处理交错进行？问题是在main函数中的forM_之后执行files <- getRecursiveContents path动作吗？

- Ralph

2

《Real World Haskell》中的“搜索文件系统”章节中的后面一部分名为“另一种遍历方式”的内容，还提供了一种更灵活的浏览文件系统的方法，它使用了折叠和迭代器。 - Shaun the Sheep

1

我（显然）从RWH中获取了函数getRecursiveContents。我没有看到后面的部分。我会去看一下。谢谢。 - Ralph

你可能想要查看 http://hackage.haskell.org/package/FilePather - singpolyma

4个回答

7

使用惰性IO / unsafe... 不是一个好的选择。惰性IO会导致许多问题，包括未关闭的资源和在纯代码中执行不纯的操作。（另请参见Haskell Wiki上的惰性I/O的问题。）

一种安全的方法是使用一些迭代器/枚举器库。（替换有问题的惰性IO是开发这些概念的动机。）您的getRecursiveContents将成为数据源（也称为枚举器）。然后数据将由某个迭代器消耗。（另请参见Haskell wiki上的Enumerator and iteratee。）

这里有关于enumerator库的教程，它只是提供了一个遍历和过滤目录树的示例，实现了一个简单的find实用程序。它实现了方法

enumDir :: FilePath -> Enumerator FilePath IO b

这基本上就是您所需要的。我相信您会发现它很有趣。

另外，有一篇不错的文章在The Monad Reader, Issue 16中解释了迭代器：Iteratee: Teaching an Old Fold New Tricks，作者是iteratee库的John W. Lato。

今天许多人更喜欢使用较新的库，比如pipes。您可能会对比较感兴趣：枚举器与导管与管道的优缺点是什么？。

- Petr

我已将您提供的所有参考资料添加到我的Instapaper帐户中，并将在工作后阅读它们。谢谢。 - Ralph

2

感谢Niklas B.的评论，以下是我的解决方案：

module Main where

import Control.Monad ( forM, forM_, liftM )
import Debug.Trace ( trace )
import System.Directory ( doesDirectoryExist, getDirectoryContents )
import System.Environment ( getArgs )
import System.FilePath ( (</>) )
import System.IO.Unsafe ( unsafeInterleaveIO )

-- From Real World Haskell, p. 214
getRecursiveContents :: FilePath -> IO [FilePath]
getRecursiveContents topPath = do
  names <- unsafeInterleaveIO $ getDirectoryContents topPath
  let
    properNames =
      filter (`notElem` [".", ".."]) $
      trace ("Processing " ++ topPath) names
  paths <- forM properNames $ \name -> do
    let path = topPath </> name
    isDirectory <- doesDirectoryExist path
    if isDirectory
      then unsafeInterleaveIO $ getRecursiveContents path
      else return [path]
  return (concat paths)

main :: IO ()
main = do
  [path] <- getArgs
  files <- unsafeInterleaveIO $ getRecursiveContents path
  forM_ files $ \file -> putStrLn $ "Found file " ++ file

有更好的方法吗？

- Ralph

0

最近我遇到了一个非常类似的问题，我正在尝试使用IO单子进行一些复杂的搜索，在找到我感兴趣的文件后停止。虽然使用Enumerator、Conduit等库的解决方案似乎是你在那些答案发布时能做到的最好的，但我刚学会IO成为GHC基础库中Alternative的实例大约一年了，这打开了一些新的可能性。以下是我编写的代码来尝试它：

import Control.Applicative (empty)
import Data.Foldable (asum)
import Data.List (isSuffixOf)
import System.Directory (doesDirectoryExist, listDirectory)
import System.FilePath ((</>))

searchFiles :: (FilePath -> IO a) -> FilePath -> IO a
searchFiles f fp = do
    isDir <- doesDirectoryExist fp
    if isDir
        then do
            entries <- listDirectory fp
            asum $ map (searchFiles f . (fp </>)) entries
        else f fp

matchFile :: String -> FilePath -> IO ()
matchFile name fp
    | name `isSuffixOf` fp = putStrLn $ "Found " ++ fp
    | otherwise = empty

searchFiles函数对目录树进行深度优先搜索，当找到符合第一个参数所确定的条件的内容时停止。而matchFile函数只是为了展示如何构建适合作为searchFiles第一个参数的函数；在实际应用中，你可能会做更复杂的事情。

有趣的是，现在你可以使用empty使IO计算“放弃”而不返回结果，并且你可以使用asum（它只是foldr(<|>) empty）将计算链接在一起，直到其中一个成功为止。

我觉得有点不安的是，IO操作的类型签名不再反映它可能故意不产生结果的事实，但它确实简化了代码。我之前尝试使用像IO(Maybe a)这样的类型，但这样做使组合操作变得非常困难。

在我看来，现在几乎没有使用 IO (Maybe a) 这种类型的理由了。但是如果你需要与使用这种类型的代码进行交互，那么在两种类型之间进行转换是很容易的。要将 IO a 转换为 IO (Maybe a)，只需使用 Control.Applicative.optional 即可；而要反过来转换，则可以使用类似以下的方法：

maybeEmpty :: IO (Maybe a) -> IO a
maybeEmpty m = m >>= maybe empty pure

- John Williams

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Gabriella Gonzalez · Accepted Answer

这正是迭代器/协程设计用来解决的问题。

你可以轻松使用完成这个任务。我对你的getRecursiveContents 的唯一更改是将其作为FilePath的Producer，并以文件名的形式respond，而不是返回它。这使得下游立即处理文件名，而不是等待getRecursiveContents完成。

module Main where

import Control.Monad ( forM_, liftM )
import Control.Proxy
import System.Directory ( doesDirectoryExist, getDirectoryContents )
import System.Environment ( getArgs )
import System.FilePath ( (</>) )

getRecursiveContents :: (Proxy p) => FilePath -> () -> Producer p FilePath IO ()
getRecursiveContents topPath () = runIdentityP $ do
  names <- lift $ getDirectoryContents topPath
  let properNames = filter (`notElem` [".", ".."]) names
  forM_ properNames $ \name -> do
    let path = topPath </> name
    isDirectory <- lift $ doesDirectoryExist path
    if isDirectory
      then getRecursiveContents path ()
      else respond path

main :: IO ()
main = do
    [path] <- getArgs
    runProxy $
            getRecursiveContents path
        >-> useD (\file -> putStrLn $ "Found file " ++ file)

这将遍历整个目录树并立即打印出每个文件，不需要使用懒惰的IO。由于只需更改useD阶段为实际的文件处理逻辑，因此很容易更改对文件名进行的操作。如果想了解有关pipes的更多信息，我强烈建议您阅读Control.Proxy.Tutorial。