如何使用 Parsec 获取字符串中特定模式的子串

4
我是Haskell的新手,现在正在学习使用parsec。我遇到了一个问题,那就是我想获取字符串中满足特定模式的所有子字符串。例如从以下字符串中,
"I want to choose F12 or F 12 from F1(a), F2a, F5-A, F34-5 and so on, but F alone should not be chosen, that is, choose those which start with F followed by a digit (before the digit there could be zero or more than one space) and then by any character from ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9'] ++ ['(',')',"-"]。
结果应该是[F12, F12, F1(a), F2a, F5-A, F34-5],其中F和数字之间的空格应该被删除。
使用parsec,我已经成功地获取了一个子字符串,如F12,F2a。 代码如下:
hao :: Parser Char
hao = oneOf "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ()-"
tuhao :: Parser String
tuhao = do { c <- char 'F'
           ; many space
           ; c1 <- digit
           ; cs <- many hao
           ; return (c:c1:cs)
           }
parse tuhao "" str -- can parse the str and get one sub-string.

然而,我卡在了如何解析上述示例字符串并获取特定模式的所有子字符串上。我的想法是,如果找到F,则开始解析,否则跳过解析;或者如果解析失败,则跳过解析。但我不知道如何实现这个计划。我有另一个想法,使用State记录未解析的剩余字符串,并使用递归,但仍无法执行。

所以我非常感谢任何提示!^_^


你的输入字符串是什么?我想看一下,但是虽然你给出了期望的输出(["F12", "F12", "F1(a)", "F2a", "F5-A", "F34-5"]),但没有给出输入。 - Adam Smith
@AdamSmith 显然问题陈述本身就是输入:“我想从F1(a)中选择F12F 12…” - Zeta
@AdamSmith 正如Zeta所说,问题描述是输入。 - Z-Y.L
[m| m<- subsequences "F1(a)F2aF5-AF34-5", elem m ["F12","F 12"]] 我得到了一个 ["F12"],这很可能是正确的。 subsequence 按字符序列化,因此无意义的字符可以被删除。我只去掉了空格和逗号。 - fp_mora
2个回答

1

F12F 12F1(a)F2aF5-AF34-5

这是一个不完整的描述,所以我会做一些猜测。

  1. I would start by defining a type that can contain the logical parts of these expressions. E.g.

    newtype F = F (Int, Maybe String) deriving Show
    

    That is, "F" followed by a number and an optional part that is either letters, parenthesised letters, or a dash followed by letters/digits. Since the number after "F" can have multiple digits, I assume that the optional letters/digits may be multiple, too.

    Since the examples are limited, I assume that the following aren't valid: F1a(b), F1(a)b, F1a-5, F1(a)-A, F1a(a)-5, F1a1, F1-(a), etc. and that the following are valid: F1A, F1abc, F1(abc), F1-abc, F1-a1b2. This is probably not true. [1]

  2. I would then proceed to write parsers for each of these sub-parts and compose them:

    module Main where
    
    import Text.Parsec
    import Data.Maybe (catMaybes)
    
    symbol :: String -> Parser String
    symbol s = string s <* spaces
    
    parens :: Parser a -> Parser a
    parens = between (string "(") (string ")")
    
    digits :: Parser Int
    digits = read <$> many1 digit
    
    parseF :: Parser F
    parseF = curry F <$> firstPart <*> secondPart
      where
        firstPart :: Parser Int
        firstPart = symbol "F" >> digits
    
        secondPart :: Parser (Maybe String)
        secondPart = optionMaybe $ choice
          [ many1 letter
          , parens (many1 letter)
          , string "-" >> many1 alphaNum
          ]
    
  3. (As Jon Purdy writes in a comment,) using this parser on a string to get multiple matches,

    extract :: Parser a -> Parser [a]
    extract p = do (:) <$> try p <*> extract p
            <|> do anyChar >> extract p
            <|> do eof >> return []
    
    readFs :: String -> Either ParseError [F]
    readFs s = parse (extract parseF) "" s
    
    main :: IO ()
    main = print (readFs "F12, F 12, F1(a), F2a, F5-A, F34-5")
    

    This prints:

    Right [F (12,Nothing),F (12,Nothing),F (1,Just "a"),F (2,Just "a"),F (5,Just "A"),F (34,Just "5")]
    

要点:

  • 您可以使用令牌分析(symbol)来解析可选的空格。

  • 您可以使用optionoptionMaybeoptional来解析可选部分。

  • 您可以使用a <|> b <|> cchoice [a, b, c]在组合子之间进行交替。

  • 在选择之间进行交替时,请确保它们没有重叠的FIRST集。否则,您需要try;这很麻烦,但有时是不可避免的。(在这种情况下,三个选择的FIRST集分别为letterstring "("string "-",即不重叠。)

[1]: 为了限制条件,我遵循了上述假设,但我认为我也可以假设F1a-BF1(a)-5F1(a)-5A是有效的,如果是这样的话,我可能会改变模型:

newtype F = F (Int, Maybe String, Maybe String)

2
这并没有解决如何从输入中提取所有这些内容的主要问题。您可以使用类似于 catMaybes <$> many (try (Just <$> parseF) <|> (Nothing <$ anyChar))(未经测试)的方法来实现。(注:该代码为Haskell语言代码) - Jon Purdy
@JonPurdy:谢谢,这确实使解析器更有用了。 - sshine
2
@SimonShine和Jon Purdy,非常感谢你们!你们的建议对我非常有帮助,特别是关于manytry的用法,这是处理不需要的字符并使用<$解决实际问题的关键。 - Z-Y.L

0

我们可以使用findAll组合器从replace-megaparsec中获取字符串中特定模式的子字符串。

请注意,这个tuhao解析器实际上并不返回任何内容。 findAll组合器只是检查解析器是否成功找到与模式匹配的子字符串。

import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Data.Maybe
import Data.Either

let tuhao :: Parsec Void String ()
    tuhao = do
        void $ single 'F'
        void $ space
        void $ digitChar
        void $ many $ oneOf "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ()-"

input = "I want to choose F12 or F 12 from F1(a), F2a, F5-A, F34-5 and so on, but F alone should not be chosen, that is, choose those which start with F followed by a digit (before the digit there could be zero or more than one space) and then by any character from ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9'] ++ ['(',')',\"-\"]."

rights $ fromJust $ parseMaybe (findAll tuhao) input

["F12","F 12","F1(a)","F2a","F5-A","F34-5"]

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接