如何使用 Parsec 获取字符串中特定模式的子串

Question

如何使用 Parsec 获取字符串中特定模式的子串

4

我是Haskell的新手，现在正在学习使用parsec。我遇到了一个问题，那就是我想获取字符串中满足特定模式的所有子字符串。例如从以下字符串中，

"I want to choose F12 or F 12 from F1(a), F2a, F5-A, F34-5 and so on, but F alone should not be chosen, that is, choose those which start with F followed by a digit (before the digit there could be zero or more than one space) and then by any character from ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9'] ++ ['(',')',"-"]。

结果应该是[F12, F12, F1(a), F2a, F5-A, F34-5]，其中F和数字之间的空格应该被删除。

使用parsec，我已经成功地获取了一个子字符串，如F12，F2a。代码如下：

hao :: Parser Char
hao = oneOf "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ()-"
tuhao :: Parser String
tuhao = do { c <- char 'F'
           ; many space
           ; c1 <- digit
           ; cs <- many hao
           ; return (c:c1:cs)
           }
parse tuhao "" str -- can parse the str and get one sub-string.

然而，我卡在了如何解析上述示例字符串并获取特定模式的所有子字符串上。我的想法是，如果找到F，则开始解析，否则跳过解析；或者如果解析失败，则跳过解析。但我不知道如何实现这个计划。我有另一个想法，使用State记录未解析的剩余字符串，并使用递归，但仍无法执行。

所以我非常感谢任何提示！^_^

- Z-Y.L

你的输入字符串是什么？我想看一下，但是虽然你给出了期望的输出(["F12", "F12", "F1(a)", "F2a", "F5-A", "F34-5"])，但没有给出输入。 - Adam Smith

@AdamSmith 显然问题陈述本身就是输入：“我想从F1（a）中选择F12或F 12…” - Zeta

@AdamSmith 正如Zeta所说，问题描述是输入。 - Z-Y.L

用 [m| m<- subsequences "F1(a)F2aF5-AF34-5", elem m ["F12","F 12"]] 我得到了一个 ["F12"]，这很可能是正确的。 subsequence 按字符序列化，因此无意义的字符可以被删除。我只去掉了空格和逗号。 - fp_mora

2个回答

0

我们可以使用findAll组合器从replace-megaparsec中获取字符串中特定模式的子字符串。

请注意，这个tuhao解析器实际上并不返回任何内容。 findAll组合器只是检查解析器是否成功找到与模式匹配的子字符串。

import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Data.Maybe
import Data.Either

let tuhao :: Parsec Void String ()
    tuhao = do
        void $ single 'F'
        void $ space
        void $ digitChar
        void $ many $ oneOf "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ()-"

input = "I want to choose F12 or F 12 from F1(a), F2a, F5-A, F34-5 and so on, but F alone should not be chosen, that is, choose those which start with F followed by a digit (before the digit there could be zero or more than one space) and then by any character from ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9'] ++ ['(',')',\"-\"]."

rights $ fromJust $ parseMaybe (findAll tuhao) input

["F12","F 12","F1(a)","F2a","F5-A","F34-5"]

- James Brock

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- sshine · Accepted Answer

F12，F 12，F1(a)，F2a，F5-A，F34-5

这是一个不完整的描述，所以我会做一些猜测。

I would start by defining a type that can contain the logical parts of these expressions. E.g.
```
newtype F = F (Int, Maybe String) deriving Show
```
That is, "F" followed by a number and an optional part that is either letters, parenthesised letters, or a dash followed by letters/digits. Since the number after "F" can have multiple digits, I assume that the optional letters/digits may be multiple, too.

Since the examples are limited, I assume that the following aren't valid: F1a(b), F1(a)b, F1a-5, F1(a)-A, F1a(a)-5, F1a1, F1-(a), etc. and that the following are valid: F1A, F1abc, F1(abc), F1-abc, F1-a1b2. This is probably not true. [1]

I would then proceed to write parsers for each of these sub-parts and compose them:

module Main where

import Text.Parsec
import Data.Maybe (catMaybes)

symbol :: String -> Parser String
symbol s = string s <* spaces

parens :: Parser a -> Parser a
parens = between (string "(") (string ")")

digits :: Parser Int
digits = read <$> many1 digit

parseF :: Parser F
parseF = curry F <$> firstPart <*> secondPart
  where
    firstPart :: Parser Int
    firstPart = symbol "F" >> digits

    secondPart :: Parser (Maybe String)
    secondPart = optionMaybe $ choice
      [ many1 letter
      , parens (many1 letter)
      , string "-" >> many1 alphaNum
      ]

(As Jon Purdy writes in a comment,) using this parser on a string to get multiple matches,

extract :: Parser a -> Parser [a]
extract p = do (:) <$> try p <*> extract p
        <|> do anyChar >> extract p
        <|> do eof >> return []

readFs :: String -> Either ParseError [F]
readFs s = parse (extract parseF) "" s

main :: IO ()
main = print (readFs "F12, F 12, F1(a), F2a, F5-A, F34-5")

This prints:

Right [F (12,Nothing),F (12,Nothing),F (1,Just "a"),F (2,Just "a"),F (5,Just "A"),F (34,Just "5")]

要点:

您可以使用令牌分析(symbol)来解析可选的空格。
您可以使用option、optionMaybe或optional来解析可选部分。
您可以使用a <|> b <|> c或choice [a, b, c]在组合子之间进行交替。
在选择之间进行交替时，请确保它们没有重叠的FIRST集。否则，您需要try；这很麻烦，但有时是不可避免的。(在这种情况下，三个选择的FIRST集分别为letter、string "("和string "-"，即不重叠。)

[1]: 为了限制条件，我遵循了上述假设，但我认为我也可以假设F1a-B，F1(a)-5和F1(a)-5A是有效的，如果是这样的话，我可能会改变模型：

newtype F = F (Int, Maybe String, Maybe String)