如何对这个输入进行词法分析？

Question

如何对这个输入进行词法分析？

3

我目前已经在Java中使用ANTLR实现了一种简单的编程语言。我想要的是将其嵌入到纯文本中，类似于PHP。

例如:

Lorem ipsum dolor sit amet
<% print('consectetur adipiscing elit'); %>
Phasellus volutpat dignissim sapien.

我预计生成的令牌流看起来会像这样：

CDATA OPEN PRINT OPAREN APOS STRING APOS CPAREN SEMI CLOSE CDATA

我该如何实现这一点，或者有更好的方法吗？ <%块外面可能没有限制。我假设像Michael Mrozek的答案中那样使用<% print（'%>'）;％>之类的东西是可能的，但除了这种情况外，<%始终表示代码块的开始。

示例实现

我基于Michael Mrozek的答案中提供的想法开发了一种解决方案，使用ANTLR的带门控语义谓词模拟Flex的起始条件：

lexer grammar Lexer;

@members {
    boolean codeMode = false;
}

OPEN    : {!codeMode}?=> '<%' { codeMode = true; } ;
CLOSE   : {codeMode}?=> '%>' { codeMode = false;} ;
LPAREN  : {codeMode}?=> '(';
//etc.

CHAR    : {!codeMode}?=> ~('<%');


parser grammar Parser;

options {
    tokenVocab = Lexer;
    output = AST;
}

tokens {
    VERBATIM;
}

program :
    (code | verbatim)+
    ;   

code :
    OPEN statement+ CLOSE -> statement+
    ;

verbatim :
    CHAR -> ^(VERBATIM CHAR)
    ;

- etheros

你只想创建一个语法，将<%和%>之间的部分标记化，忽略其他所有内容？ - Bart Kiers

我不想完全忽略它；我想逐字回显它。 - etheros

在 <% 和 %> 之外是否有任何限制，或者它可以采取任何形式？注释是否可以出现在它们之外（可能包含 <% 或 %>）？ - Bart Kiers

<% 块外面可以是任何内容，我假设像 Michael Mrozek 的回答中所说的 <% print('%>'); %> 这样的语句是可能的，但在这种情况之外，<% 总是表示代码块的开始。 - etheros

2个回答

1

实际概念看起来不错，尽管你可能没有一个PRINT令牌；词法分析器可能会发出类似于IDENTIFIER的东西，而解析器将负责找出它是一个函数调用（例如通过查找IDENTIFIER OPAREN ... CPAREN）并执行适当的操作。

至于如何做到这一点，我对ANTLR一无所知，但它可能有类似于flex的start conditions。如果是这样，您可以让INITIAL启动条件什么都不做，只需寻找<%，这将切换到定义所有实际标记的CODE状态；然后'%>'将切换回来。在flex中，它将是：

%s CODE

%%

<INITIAL>{
    "<%"      {BEGIN(CODE);}
    .         {}
}

 /* All these are implicitly in CODE because it was declared %s,
    but you could wrap it in <CODE>{} too
  */
"%>"          {BEGIN(INITIAL);}
"("           {return OPAREN;}
"'"           {return APOS;}
...

在编程中，你需要小心处理像匹配%>这样的东西，特别是在它不是一个闭合标记的上下文中，比如在字符串内部；如果你想允许<% print('%>'); %>，那就由你决定了，但大多数情况下你可能会这么做。

- Michael Mrozek

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Bart Kiers · Accepted Answer

但除此之外，<% 总是表示代码块的开始。

在这种情况下，首先扫描文件以查找嵌入式代码，一旦获得这些代码，就用专用解析器解析嵌入式代码（忽略<%和%>标记之前和之后的噪音）。

ANTLR具有使词法分析器仅解析输入文件的一部分（小部分）并忽略其余部分的选项。请注意，在这种情况下，您无法创建“组合语法”（即包含解析器和词法分析器的单个文件）。以下是如何创建这样的“部分词法分析器”的方法：

// file EmbeddedCodeLexer.g
lexer grammar EmbeddedCodeLexer;

options{filter=true;} // <- enables the partial lexing!

EmbeddedCode
  :  '<%'                            // match an open tag
     (  String                       // ( match a string literal
     |  ~('%' | '\'')                //   OR match any char except `%` and `'`
     |  {input.LT(2) != '>'}?=> '%'  //   OR only match a `%` if `>` is not ahead of it
     )*                              // ) <- zero or more times
     '%>'                            // match a close tag
  ;

fragment
String
  :  '\'' ('\\' . | ~('\'' | '\\'))* '\''
  ;

如果您现在从中创建一个词法分析器：

java -cp antlr-3.2.jar org.antlr.Tool EmbeddedCodeLexer.g

并创建一个小的测试工具:

import org.antlr.runtime.*;

public class Main {
    public static void main(String[] args) throws Exception {
        String source = "Lorem ipsum dolor sit amet       \n"+
                "<%                                       \n"+
                "a = 2 > 1 && 10 % 3;                     \n"+
                "print('consectetur %> adipiscing elit'); \n"+
                "%>                                       \n"+
                "Phasellus volutpat dignissim sapien.     \n"+
                "foo <% more code! %> bar                 \n";
        ANTLRStringStream in = new ANTLRStringStream(source);
        EmbeddedCodeLexer lexer = new EmbeddedCodeLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        for(Object o : tokens.getTokens()) {
            System.out.println("=======================================\n"+
                    "EmbeddedCode = "+((Token)o).getText());
        }
    }
}

编译所有内容：

javac -cp antlr-3.2.jar *.java

最后，通过执行以下操作运行Main类：

// *nix/MacOS
java -cp .:antlr-3.2.jar Main

// Windows
java -cp .;antlr-3.2.jar Main

它将会产生以下输出：

=======================================
EmbeddedCode = <%                                       
a = 2 > 1 && 10 % 3;                     
print('consectetur %> adipiscing elit'); 
%>
=======================================
EmbeddedCode = <% more code! %>