Flex中的Unicode字符?

3

我有一个关于两个Unicode字符的简单问题,我想在我的编程语言中使用它们。在一个任务中,我想要使用旧的APL符号←和→。

我的flex文件(snazzle.l)如下所示:

/** phi@gress.ly 2017                     **/
/** parser for omni programming language. **/
%{
#include <iostream>
using namespace std;
#define YY_DECL extern "C" int yylex()
int linenum = 0;
%}

%%
[\n]           {++linenum;}

[ \t]           ;
[0-9]+\.[0-9]+([eE][+-]?[0-9]+)?  { cout << linenum << ". Found a floating-point number: " << yytext << endl; }
\"[^\"]*\"      { cout << linenum << ". Found string: " << yytext << endl; }
[0-9]+          { cout << linenum << ". Found an integer: " << yytext << endl; }
[a-zA-Z0-9]+    { cout << linenum << ". Found an identifier: "   << yytext << endl; }
([\←])|([\→])|(:=)|(=:)  { cout << linenum << ". Found assignment operator: " << yytext <<endl; }
[\;]            { cout << linenum << ". Found statement delimiter: " << yytext <<endl; }
[\[\]\(\)\{\}]  { cout << linenum << ". Found parantheses: " << yytext << endl; }

%%
main() {
    // lex through the input:
    yylex();
}

当我对以下输入进行“snazzle”时:
x →  y;

我犯了两个错误:a) 错误地赋值了字符; b) 赋值了三次。

0. Found an identifier: x
0. Found assignment operator: 
0. Found assignment operator: 
0. Found assignment operator: 
0. Found an identifier: y
0. Found statement delimiter: ;

我该如何将 ← 和 → 添加为可能的弹性字符?
1个回答

5

Flex可以生成八位无误差的扫描仪,即它可以处理任何由任意八位组成的输入。它对UTF-8或Unicode代码点一无所知,但这并不妨碍它将Unicode输入字符识别为八位组成的序列(而不是单个字符)。它将识别哪个序列取决于您使用的Unicode编码,但假设您的文件是UTF-8,则→将是三个字节e2 86 92,而←将是e2 86 90

然而,您实际上不需要知道这一点;您只需将UTF-8序列放入您的flex模式中即可。您甚至不需要引用它,尽管如果您最终使用正则表达式运算符可能会更少混淆,这是一个好主意。在这里,我真正的意思是“引用它”,例如"←"\←不会像您期望的那样工作,因为\仅适用于下一个八位组(如我所说,flex对Unicode编码一无所知),而该符号中仅有的三个字节中的第一个字节是八位组。换句话说,"←"?实际上意味着“一个可选的左箭头”,而\←?则意味着“两个八位组\xE2 \x86,后面可选跟随一个\x90”。我希望这很清楚。

Flex字符类对于Unicode序列(或任何其他多字符序列)并不有用,因为字符类是一组八位组。因此,如果您编写[←],flex将解释为“其中一个八位组\xE2\x86\x90”。[注1]

注释

  1. It is rarely necessary to backslash-escape characters inside flex character classes; the only character which must be backslash-escaped is the backslash itself. It is not an error to escape characters which don't need escaping, so flex won't complain about it, but it makes the character classes hard for humans to read (at least, for this human to read). So [\←] means exactly the same as [←] and you could write [\[\]\(\)\{\}] as [][)(}{]. (] does not close a character class if it is the first character in the class, so it is conventional to write parentheses "face-to-face").

  2. It is also not necessary to parenthesize character sequences inside alternatives, so you could write ([\←])|([\→])|(:=)|(=:) as ←|→|:=|=:. Or, if you prefer, "←"|"→"|":="|"=:". Of course, you wouldn't usually do that, since the scanner normally informs the parser about each individual operator. If your intention is to make ← a synonym of :=, then you would probably end up with:

    ←|:=    { return LEFT_ARROW; }
    →|=:    { return RIGHT_ARROW; }
    
  3. Rather than inserting printf actions in your scanner specification, you would be better off asking flex to put your scanner in debug mode. That is as simple as adding -d to the flex command line when you are building your scanner. See the flex manual section on debugging for more details.


1
哇,这真的很有帮助。非常感谢你,rici。我很少在论坛上看到这样好的答案。 - iGeeks
1
同意,但是虽然Flex将UTF-8八位字节序列视为8位代码点,但Flex从未被设计成正确处理Unicode。只有非常简单的UTF-8模式可以被容忍和处理。有更好的免费扫描器生成器用于C或C ++,可以正确处理Unicode UTF-8,UTF-16和UTF-32编码(标记可能带来风险的无效UTF-8!)。Quex或RE / flex是不错的选择。 - Dr. Alex RE

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接