在Flex中进行非贪婪正则表达式匹配

Question

在Flex中进行非贪婪正则表达式匹配

7

我刚开始接触Flex，似乎无法弄清如何匹配以下表达式：

"Dog".*"Cat"
------------------
Input :
Dog Ca Cat Cc Cat
------------------
Output:
Dog Ca Cat Cc Cat

但我希望进行非贪婪匹配，并输出以下结果:

Output:
Dog Ca Cat

这如何在Flex上实现？编辑尝试了以下方法：

%%
Dog.*Cat/.*Cat  printf("Matched : ||%s||", yytext);
dog.*cat        printf("Matched : ||%s||", yytext);
dOg[^c]*cAt     printf("Matched : ||%s||", yytext);
DOG.*?CAT       printf("Matched : ||%s||", yytext);
%%

输入：

Dog Ca Cat Cc Cat
dog Ca cat Cc cat
dOg Ca cAt Cc cAt
DOG CA CAT CC CAT

输出:

Matched : ||Dog Ca Cat Cc Cat||
Matched : ||dog Ca cat Cc cat||
Matched : ||dOg Ca cAt|| Cc cAt
Matched : ||DOG CA CAT CC CAT||

同样收到警告:

lex4.l:2: warning, dangerous trailing context

Flex 版本：

flex 2.5.35 Apple(flex-31)

- Kyuubi

3个回答

2

一个好问题。这里提供了一个纯正的正则表达式解决方案，不使用非贪婪的 .*? 语法： Dog([^C]|C+(aC+)*([^Ca]|a[^Ct]))*C+(aC+)*at

- Paul

0

这里是一个针对此问题的极简C++ flex lexer。非贪婪匹配的关键在于flex手册和其他地方提到的起始条件。

起始条件只是lexer的另一个状态。当需要非贪婪匹配时，有一些模式需要在其第一次出现时终止匹配。

通常情况下，无论状态如何，如果您正在寻找目标字符串或模式，则只需确保没有其他更一般的模式可以匹配包含目标模式的更长的输入。

当目标模式有条件且需要在先前匹配后启用时，启动条件很有帮助。您可以打开启动条件以启用匹配目标模式，并通过将状态重置为0或INITIAL-或者切换到另一个状态进行更多的条件匹配来关闭它。

使用BEGIN切换状态-还有用于通过yy_push_state和yy_pop_state进行状态堆栈操作。

在flex手册中有许多起始条件的示例。

以下是展示使用flex起始条件进行非贪婪匹配的flex规则- lexer将匹配行上dog的第一次出现，直到cat的第一次出现- 匹配不区分大小写。

完整的文件已发布在末尾 - 对于不熟悉flex的人，请注意许多行以空格开头 - 这不是偶然的，而是flex所必需的。

%%
 /* flex rules section */

 string match;

dog {
// found a dog, change state to HAVE_DOG to start looking for a cat
 BEGIN(HAVE_DOG);
// save the found dog
 match = yytext;
}

 /* save and keep going till cat is found */
<HAVE_DOG>. match += yytext;

<HAVE_DOG>cat {
// save the found cat
 match += yytext;
// output the matched dog and cat
 cout << match << "\n";
// ignore rest of line
 BEGIN(SKIP_LINE);
}

 /* no cat on this line, reset state */
<HAVE_DOG>\n BEGIN(0);

 /* rules to ignore rest of the line then reset state */
<SKIP_LINE>{
  .*
  \n BEGIN(0);
}

 /* nothing to do yet */
.|\n

这是一些测试输入

$ cat dogcat.in.txt

Dog Ca Cat Cc Cat
dog Ca cat Cc cat
dOg Ca cAt Cc cAt
DOG CA CAT CC CAT
cat dog dog cat cat
dog kitten cat dog cat
dig cat dog can dog cut
dig dug dog cow cat cat
doc dogcat catdog
dog dog dog
cat cat cat

使用以下工具构建

flex -o dogcat.flex.cpp dogcat.flex.l && g++ -o dogcat dogcat.flex.cpp

运行

$ ./dogcat < dogcat.in.txt

Dog Ca Cat
dog Ca cat
dOg Ca cAt
DOG CA CAT
dog dog cat
dog kitten cat
dog cow cat
dogcat

完整的 Flex 文件

 /* dogcat.flex.l */

 /*
 Build with:
 flex -o dogcat.flex.cpp dogcat.flex.l && g++ -o dogcat dogcat.flex.cpp
 */

 /*
 A minimal C++ flex lexer that shows nongreedy matching with flex
 start conditions
 matches the first occurrence of dog on a line till the first
 occurrence of cat 
 matching is case insensitive
 */

 /* C++ lexer using yyFlexLexer in FlexLexer.h */
%option c++

 /* case-insensitive patterns */
%option case-insensitive

 /* generate main function for executable */
%option main

 /* all input must be matched, no echo by default */
%option nodefault

 /* debug output with lexer.set_debug(1) */
%option debug

 /* start condition means dog was matched */
%x HAVE_DOG
 /* start condition means to ignore remaining line */
%x SKIP_LINE

%{

#include <string>
#include <iostream>

// C++ flex lexer class
// needed because header itself has no guard
#ifndef yyFlexLexerOnce
#  include <FlexLexer.h>
#endif

using namespace std;

namespace {
// the C++ lexer class from flex
  yyFlexLexer lexer;

// main generated by flex still calls free yylex function even for C++ lexer
  int yylex() {
    return lexer.yylex();
  }
}

%}

%%
 /* flex rules section */

 string match;

dog {
// found a dog, change state to HAVE_DOG to start looking for a cat
 BEGIN(HAVE_DOG);
// save the found dog
 match = yytext;
}

 /* save and keep going till cat is found */
<HAVE_DOG>. match += yytext;

<HAVE_DOG>cat {
// save the found cat
 match += yytext;
// output the matched dog and cat
 cout << match << "\n";
// ignore rest of line
 BEGIN(SKIP_LINE);
}

 /* no cat on this line, reset state */
<HAVE_DOG>\n BEGIN(0);

 /* rules to ignore rest of the line then reset state */
<SKIP_LINE>{
  .*
  \n BEGIN(0);
}

 /* nothing to do yet */
.|\n

- Zartaj Majeed

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Brian Tompsett - 汤莱恩 · Accepted Answer

这是使用lex/flex工具时常见的问题，这困扰着初学者（有时也会困扰非初学者）。解决这个问题有两种方法，需要使用工具的两种不同高级功能。像 dog ... cat 这样的短语与匹配各种编程语言中的注释（如C中的注释形式 /* ... */ 或甚至是 'comment' ... 'tnemmoc'）问题相同。它们与你的示例具有完全相同的特征。请考虑以下C代码：

/* This is a comment */ "This is a String */"

贪婪的词法匹配会匹配错误的注释终止符（这是一个学生词法分析器的好测试！）。

在几个大学编译器课程中有建议的解决方案。其中一个很好地解释了它，位于这里（曼彻斯特）。该网页引用了几本涵盖这些问题的好书：

J.Levine、T.Mason和D.Brown：Lex and Yacc（第二版）
M.E.Lesk和E.Schmidt：Lex - A Lexical Analyzer Generator

描述的两种技术是使用Start Conditions显式指定状态机，或者使用manual input直接读取字符。

对于你的cat ... dog问题，可以按以下方式编程：

Start Conditions

在此解决方案中，我们需要几个状态。关键字dog使其进入DOG状态，该状态将一直持续到遇到字母c。然后进入LETTERC状态，必须跟随字母a，否则DOG状态将继续；字母a会导致进入CAT状态，后面必须跟随一个字母t，这将导致整个短语被匹配并返回到INITIAL状态。yymore导致整个dog ... cat文本被保留供使用。

%x DOG LETTERC CAT
d [dD]
o [oO]
g [gG]
c [cC]
a [aA]
t [tT]
ws [ \t\r\n]+
%%

<INITIAL>{d}{o}{g} {
        BEGIN(DOG);
        printf("DOG\n");
        yymore();
        }
<DOG>[^cC]*{c} {
        printf("C: %s\n",yytext);
        yymore();
        BEGIN(LETTERC);
        }
<LETTERC>{a} {
       printf("A: %s\n",yytext);
       yymore();
       BEGIN(CAT);
      }
<LETTERC>[^aA] {
        BEGIN(DOG);
        yymore();
        }
<CAT>{t} {
        printf("CAT: %s\n",yytext);
        BEGIN(INITIAL);
        }
<CAT>[^tT] {
        BEGIN(DOG);
        yymore();
        }
<INITIAL>{ws}  /* skip */ ;

手动输入

手动输入方法仅匹配起始短语dog，然后输入C代码，它会吞噬输入字符，直到遇到所需的cat序列。(我没有考虑大小写字母)。这种解决方案的问题在于很难在yytext中保留输入文本值以供稍后在解析器中使用。它会丢弃它，如果构造是注释，则可以接受，但否则不太有用。

d [dD]
o [oO]
g [gG]
ws [ \t\r\n]+
%%
{d}{o}{g}   {
   register int c;

                     for ( ; ; )
                         {
                         /* Not dealt with upper case .. left as an exercise */
                         while ( (c = input()) != 'c' &&
                                 c != EOF )
                             ;    /* eat up text of dog */

                         if ( c == 'c' )
                             {
                              if ( ( c = input()) == 'a' )
                                     if ( (c = input()) == 't' )
                                 break;    /* found the end */
                             }
                        if ( c == EOF )
                             {
                             REJECT;
                             break;
                             }
                         }
            /* because we have used input() yytext always contains "dog" */
            printf("cat: %s\n", yytext);
       }
{ws}  /* skip */ ;

这两种解决方案都经过测试。