如何创建一个解析器(lex/yacc)?

Question

如何创建一个解析器(lex/yacc)?

3

我有以下需要解析的文件

--TestFile
Start ASDF123
Name "John"
Address "#6,US" 
end ASDF123

以--开头的行将被视为注释行。文件以'Start'开始，以end结束。在Start之后的字符串是UserID，然后name和address将在双引号内。

我需要解析该文件并将解析后的数据写入xml文件。

因此，生成的文件将如下所示：

<ASDF123>
  <Name Value="John" />
  <Address Value="#6,US" />
</ASDF123>

现在我正在使用模式匹配（正则表达式）解析上述文件。这是我的示例代码：

    /// <summary>
    /// To Store the row data from the file
    /// </summary>
    List<String> MyList = new List<String>();

    String strName = "";
    String strAddress = "";
    String strInfo = "";

方法: 读取文件

    /// <summary>
    /// To read the file into a List
    /// </summary>
    private void ReadFile()
    {
        StreamReader Reader = new StreamReader(Application.StartupPath + "\\TestFile.txt");
        while (!Reader.EndOfStream)
        {
            MyList.Add(Reader.ReadLine());
        }
        Reader.Close();
    }

方法：格式化行数据

    /// <summary>
    /// To remove comments 
    /// </summary>
    private void FormateRowData()
    {
        MyList = MyList.Where(X => X != "").Where(X => X.StartsWith("--")==false ).ToList();
    }

方法：解析数据

    /// <summary>
    /// To Parse the data from the List
    /// </summary>
    private void ParseData()
    {
        Match l_mMatch;
        Regex RegData = new Regex("start[ \t\r\n]*(?<Data>[a-z0-9]*)", RegexOptions.IgnoreCase);
        Regex RegName = new Regex("name [ \t\r\n]*\"(?<Name>[a-z]*)\"", RegexOptions.IgnoreCase);
        Regex RegAddress = new Regex("address [ \t\r\n]*\"(?<Address>[a-z0-9 #,]*)\"", RegexOptions.IgnoreCase);
        for (int Index = 0; Index < MyList.Count; Index++)
        {
            l_mMatch = RegData.Match(MyList[Index]);
            if (l_mMatch.Success)
                strInfo = l_mMatch.Groups["Data"].Value;
            l_mMatch = RegName.Match(MyList[Index]);
            if (l_mMatch.Success)
                strName = l_mMatch.Groups["Name"].Value;
            l_mMatch = RegAddress.Match(MyList[Index]);
            if (l_mMatch.Success)
                strAddress = l_mMatch.Groups["Address"].Value;
        }
    }

Method : WriteFile

    /// <summary>
    /// To write parsed information into file.
    /// </summary>
    private void WriteFile()
    {
        XDocument XD = new XDocument(
                           new XElement(strInfo,
                                         new XElement("Name",
                                             new XAttribute("Value", strName)),
                                         new XElement("Address",
                                             new XAttribute("Value", strAddress))));
        XD.Save(Application.StartupPath + "\\File.xml");
    }

我听说过ParserGenerator

请帮我使用lex和yacc编写解析器。这是因为我的现有解析器(Pattern Matching)不够灵活，而且我认为它不是正确的方法。

如何使用ParserGenerator(我已经阅读了Code Project Sample One 和 Code Project Sample Two，但我仍然不熟悉它)。请建议一些可以输出C#解析器的解析器生成器。

- Thorin Oakenshield

2个回答

1

首先，您需要为解析器定义语法（Yacc 部分）。

看起来应该是这样的：

file : record file
     ;

record: start identifier recordContent end identifier {//rule to match the two identifiers}
      ;

recordContent: name value; //Can be more detailed if you require order in the fields

词法分析将由lex执行。我猜你的正则表达式会对它们进行定义很有用。

我的答案是初稿，我建议你在互联网上寻找更完整的lex/yacc flex/bison教程，如果你有更专注的问题再回来这里。

我也不知道是否有C#实现可以让你保持托管代码。你可能需要使用非托管的C/C++导入。

- M'vy

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Aasmund Eldhuset · Accepted Answer

Gardens Point LEX 和 Gardens Point Parser Generator 受到了 LEX 和 YACC 的强烈影响，并输出 C# 代码。

你的语法足够简单，我认为你当前的方法是可以的，但要学习“真正”的方法还是很不错的。 :-) 所以这是我对语法的建议（只有产生式规则；这远非一个完整的示例。实际的 GPPG 文件需要用 C# 代码替换 ... 来构建语法树，您需要令牌声明等 - 阅读文档中的 GPPG 示例。您还需要描述令牌的 GPLEX 文件）：

/* Your input file is a list of "top level elements" */
TopLevel : 
    TopLevel TopLevelElement { ... }
    | /* (empty) */

/* A top level element is either a comment or a block. 
   The COMMENT token must be described in the GPLEX file as 
   any line that starts with -- . */
TopLevelElement:
    Block { ... }
    | COMMENT { ... }

/* A block starts with the token START (which, in the GPLEX file, 
   is defined as the string "Start"), continues with some identifier 
   (the block name), then has a list of elements, and finally the token
   END followed by an identifier. If you want to validate that the
   END identifier is the same as the START identifier, you can do that
   in the C# code that analyses the syntax tree built by GPPG.
   The token Identifier is also defined with a regular expression in GPLEX. */
Block:
    START Identifier BlockElementList END Identifier { ... }

BlockElementList:
    BlockElementList BlockElement { ... }
    | /* empty */

BlockElement:
    (NAME | ADDRESS) QuotedString { ... }