"Real World Haskell" 中的 splitWith 函数正确的实现方式是什么？

Question

"Real World Haskell" 中的 splitWith 函数正确的实现方式是什么？

4

我一直在学习《实践中的 Haskell》，并尝试完成其中的练习。我成功地实现了第4.5节练习2中splitWith的工作版本。但我感觉这不是很符合 Haskell 的风格。使用累加器实现一个新函数似乎很绕弯子。有没有更惯用的方法，比如使用 fold？我看了 foldl 的文档，但是还是不太明白怎么用。

splitWith :: (a -> Bool) -> [a] -> [[a]]
splitWith _ [] = []
splitWith f a  = splitWithAcc f a []
  where 
    splitWithAcc :: (a -> Bool) -> [a] -> [[a]] -> [[a]]
    splitWithAcc f xs acc
      | null xs     = acc
      | f $ head xs = splitWithAcc f (dropWhile f xs) (acc ++ [takeWhile f xs])
      | otherwise   = splitWithAcc f (tail xs) acc

澄清

以下是练习题的文本：

编写一个名为splitWith的函数，类似于words函数，它接受一个谓词和一个任意类型的列表，然后在每个谓词返回False的元素上拆分其输入列表：

- Michael Barton

想要的输出是什么？当我运行这段代码 splitWith (== ' ') "This is a test" 时，我得到了 [" "," "," "]，但如果我将它从 == 改为 /=，我会得到 ["This","is","a","test"]。你想在条件为假的地方分割，还是在条件为真的地方分割？ - bheklilr

我添加了练习的文本。正如你所说，它似乎有点不寻常。它希望在谓词为假时进行拆分。 - Michael Barton

我已经更新了我的答案以反映这个变化。 - bheklilr

5个回答

2

想象一下 foldr 从右边开始构建其结果：

splitWith f xs = case foldr g [[]] xs of {([]:r)-> r; r->r}
  where
    g x r@ ~(s:t) | f x = (x:s):t     -- keep `x` if `f x`
                  | null s = r        -- current word already empty
                  | otherwise = []:r  -- split

惰性模式允许将无限列表作为输入。测试：

Prelude> splitWith (/= ' ') "  This is a    test  "
["This","is","a","test"]
Prelude> splitWith (/= ' ') ""
[]
Prelude> take 8 $ splitWith (/= ' ') (cycle "12   12 ")
["12","12","12","12","12","12","12","12"]

- Will Ness

2

在做这个练习时，我想到了以下解决方案。我所知道的所有Haskell都来自于这本书，因此我的解决方案不应包含迄今未在书中提到的任何结构：

splitWith pred (x:xs)
    | pred x    = let (first, rest) = span pred (x:xs)
                  in  first : (splitWith pred rest)
    | otherwise = splitWith pred xs
splitWith pred []    = []

- Joakim Karlsson

1

splitWith' :: (a -> Bool) -> [a] -> [[a]]
splitWith' p xs = foldr with [[]] xs
  where
    with a acc@(as:rest) | p a       = (a:as):rest
                         | otherwise = []:acc

- Satvik

splitWith' (/= ' ') "This is a test "（两边各有两个空格）的结果应该是 ["This","is","a","","test","",""]；splitWith' (/= ' ') "" 的结果应该是 [""]。需要调整逻辑，才能得到分别为 ["This","is","a","test"] 和 [] 的结果。 - Will Ness

@WillNess 我删除了那个问题，仅仅是因为它是一个重复的问题。我现在已经把它恢复了。此外，我知道这个答案没有处理边角情况，但我认为它给了 OP 足够的提示来修改它使其正常工作。难道赚取你的午餐不比只是被服务更好吗？ - Satvik

1

import Data.List (groupBy)
splitWith :: (a -> Bool) -> [a] -> [[a]]
splitWith p = filter (all p) . groupBy ((==) `on` p)

实际上，any 可以替代 all，因为它更便宜，而且 groupBy 保证了满足 p 条件的 [a] 元素会被聚集在一起（所以如果 any 成立，那么 all 也成立）；总之，p . head 也可以替代 all p。

- Enlico

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- bheklilr · Accepted Answer

递归是你的好帮手，但我会稍微改变一下方法。首先，在分割时，我会将我的条件设置为 True 而不是 False。其次，我会使用来自 Data.List 的一个方便函数叫做 break。

> :t break
break :: (a -> Bool) -> [a] -> ([a], [a])
> break (== ' ') "This is a test"
("This", " is a test")

我会使用它来定义我的函数：

splitWith' :: (a -> Bool) -> [a] -> [[a]]
splitWith' cond [] = []
splitWith' cond xs = first : splitWith' cond (safeTail rest)
    where
        (first, rest) = break cond xs
        -- Need this function to handle an empty list
        safeTail [] = []
        safeTail (_:ys) = ys

或者，如果您想尽可能地让它变得混乱

splitWith'' :: (a -> Bool) -> [a] -> [[a]]
splitWith'' _ [] = []
splitWith'' cond xs = uncurry (:) $ fmap (splitWith'' cond . safeTail) $ break cond xs
    where
        safeTail [] = []
        safeTail (_:ys) = ys

这段代码之所以可行，是因为对于2元组，fmap会将函数应用于元组的第二个元素。然后它会解开:的柯里化，并将其应用于第一个元素和剩余部分。更新如果你想在谓词为false时拆分它，你可以使用span代替break，或者只需定义它为：

splitWithWeird cond xs = splitWith' (not . cond) xs

尽管第二个选项显然会产生稍微小一些的开销（除非编译器可以优化掉它）。

更新2

如果您想处理重复字符，如果适合您的需求，有一个简单快速的解决方法：

> filter (not . null) $ splitWithWeird (/= ' ') "This  is   a    test"
["This","is","a","test"]

使用这样简单的修复方法，我们可能会想要将其构建到算法本身中：

splitWithWeird :: (a -> Bool) -> [a] -> [[a]]
splitWithWeird cond [] = []
splitWithWeird cond xs = filter (not . null) $ first : splitWithWeird cond (safeTail rest)
    where
        (first, rest) = span cond xs
        safeTail [] = []
        safeTail (_:ys) = ys

但这是不明智的。由于这是一个递归函数，您在每个级别上都添加了对filter(not.null)的调用，因此在函数中的每个拆分位置上都要进行扫描整个列表，因此需要执行额外的检查。最好将其定义为单独的函数，以便只调用一次filter(not.null)：

splitWithWeird' :: (a -> Bool) -> [a] -> [[a]]
splitWithWeird' cond xs = filter (not . null) $ splitWithWeird cond xs

或者，如果您想将其内置到算法中：

splitWithWeird :: (a -> Bool) -> [a] -> [[a]]
splitWithWeird cond xs = filter (not . null) $ splitWithHelper cond xs
    where
        safeTail [] = []
        safeTail (_:ys) = ys
        splitWithHelper cond [] = []
        splitWithHelper cond xs =
            let (first, rest) = span cond xs
            in first : splitWithHelper cond (safeTail rest)

这实际上只是在内部定义两个函数所做的相同的事情。请注意，我必须在这里使用额外的 let ... in ... 语句（我不喜欢嵌套 where 语句），因为 (first, rest) = span cond xs 属于 splitWithHelper，而不属于 splitWithWeird。如果你将其留在 where 子句中，算法将无法工作。

更新3

为了不仅提供一个不理想的解决方案，我已经编写了一个基于子序列而不仅仅是条件或元素拆分的算法。它确实使用了来自 Control.Arrow 的 first 函数，但仅仅是为了使代码更加紧凑。

import Control.Arrow (first)

isPrefixOf :: Eq a => [a] -> [a] -> Bool
isPrefixOf [] _ = True
isPrefixOf _ [] = False
isPrefixOf (x:xs) (y:ys) = x == y && isPrefixOf xs ys

splitSubseq :: Eq a => [a] -> [a] -> [[a]]
splitSubseq sub [] = []
splitSubseq sub xs = initial : splitSubseq sub rest
    where
        lsub = length sub
        splitter [] = ([], [])
        splitter yss@(y:ys)
            | isPrefixOf sub yss = ([], drop lsub yss)
            | otherwise = first (y :) $ splitter ys
        (initial, rest) = splitter xs

我并不是说这是一种高效的解决方案，但它应该相当容易理解。首先，我定义了一个名为isPrefixOf的函数，如果第二个列表以第一个列表开头，则返回True。

我想保持递归的相同模式（first : recursive rest），因此我编写了splitter来替代span或break，这就是isPrefixOf的作用所在。如果子序列是列表的前缀，则返回([],restAfterSubsequence)，否则它存储列表的第一个字符，然后从下一个元素开始重复此操作。我在这里使用first只是为了能够简洁地递归编写此函数。它只是将(y:)应用于splitter的返回值的第一个元素。从splitter返回的元组的第二个元素只是尚未使用的输入的其余部分。

如果您感兴趣，这是该算法的性能统计数据（使用--make -O2编译，i5四核）：

main = print $ sum $ take (10 ^ 7) $ map length $ splitSubseq " " $ cycle "Testing "

70000000
   6,840,052,808 bytes allocated in the heap
       2,032,868 bytes copied during GC
          42,900 bytes maximum residency (2 sample(s))
          22,636 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     13114 colls,     0 par    0.06s    0.07s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0002s    0.0004s

  TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    3.68s  (  3.74s elapsed)
  GC      time    0.06s  (  0.07s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    3.74s  (  3.81s elapsed)

然后去嵌入求和和长度：

main = print $ sum $ take (10 ^ 7) $ map length $ repeat "Testing"

70000000
     240,052,572 bytes allocated in the heap
          12,812 bytes copied during GC
          42,900 bytes maximum residency (2 sample(s))
          22,636 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       458 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0001s

  TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.09s  (  0.09s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.11s  (  0.09s elapsed)

因此，我们可以看到这只需要大约0.1秒的时间，留下大约3.64秒的时间来将一个由“Testing”重复10百万次构成的字符串分割开，同时仅使用少量内存。唯一的缺点是，当使用-threaded编译并在多个核心上运行时，这种算法实际上会显着减慢速度。