下面的代码包含一个正则表达式,旨在提取C#字符串文字,但对于多个字符的输入字符串,正则表达式匹配的性能非常糟糕。
class Program
{
private static void StringMatch(string s)
{
// regex: quote, zero-or-more-(zero-or-more-non-backslash-quote, optional-backslash-anychar), quote
Match m = Regex.Match(s, "\"(([^\\\\\"]*)(\\\\.)?)*\"");
if (m.Success)
Trace.WriteLine(m.Value);
else
Trace.WriteLine("no match");
}
public static void Main()
{
// this first string is unterminated (so the match fails), but it returns instantly
StringMatch("\"OK");
// this string is terminated (the match succeeds)
StringMatch("\"This is a longer terminated string - it matches and returns instantly\"");
// this string is unterminated (so the match will fail), but it never returns
StringMatch("\"This is another unterminated string and takes FOREVER to match");
}
}
我可以将正则表达式重构为另一种形式,但有人能解释一下性能为什么这么糟糕吗?
[^\"]
不会在\"
停止,它会在\
或"
处停止。因此,它将在\n
的\
处停止。这正确吗? - xanatos