正则表达式:如何确定给定字符之前的字符出现的次数是奇数还是偶数?

15

我想仅在未引用的词语中将|替换为OR,例如:

"this | that" | "the | other" -> "this | that" OR "the | other"

是的,我可以按空格或引号分割字符串,获取数组并遍历它,然后重构字符串,但这似乎不够优雅。所以也许有一种正则表达式的方法来计算"|之前出现的次数,奇数意味着|被引用,偶数意味着未引用。(注意:如果至少有一个",则处理直到有偶数个"才开始)。


6
正则表达式不会做到那一点。那些能够做到的正则库使用的算法并非基于正则表达式,不能保证相同的效率。Sinan的回答基于以下观察:你想要更改的管道字符总是出现在两个引号之间,而你不想更改的管道字符从未如此。如果有效的话,这是一个好的解决方案。否则,请放弃使用正则表达式。 - Daniel C. Sobral
问题的文本有答案。问题的标题没有。 - Daniel C. Sobral
9个回答

13

虽然正则表达式不能计数,但它们可以用来确定某个东西的数量是奇数还是偶数。在这种情况下的技巧是检查管道符号之后的引号,而不是之前的引号。

str = str.replace(/\|(?=(?:(?:[^"]*"){2})*[^"]*$)/g, "OR");

将其分解,(?:[^"]*"){2} 匹配下一对引号(如果有的话),以及中间的非引号部分。在尽可能多次执行此操作之后(可能为零),[^"]*$ 消耗任何剩余的非引号字符,直到字符串的结尾。

当然,这假设文本格式良好。它也没有解决转义引号的问题,但如果需要,它可以解决。


5
正则表达式不算。这就是解析器存在的意义。

1
是的,这个问题非常适合使用状态机。 - Sean Cavanagh

4

您可能会发现这个问题的Perl FAQ很相关。

#!/usr/bin/perl

use strict;
use warnings;

my $x = qq{"this | that" | "the | other"};
print join('" OR "', split /" \| "/, $x), "\n";

1

你不需要计数,因为你没有嵌套引号。这样就可以了:

#!/usr/bin/perl

my $str = '" this \" | that" | "the | other" | "still | something | else"';
print "$str\n";

while($str =~ /^((?:[^"|\\]*|\\.|"(?:[^\\"]|\\.)*")*)\|/) {
        $str =~ s/^((?:[^"|\\]*|\\.|"(?:[^\\"]|\\.)*")*)\|/$1OR/;
}

print "$str\n";

现在,让我们解释一下这个表达式。

^  -- means you'll always match everything from the beginning of the string, otherwise
      the match might start inside a quote, and break everything

(...)\|   -- this means you'll match a certain pattern, followed by a |, which appears
             escaped here; so when you replace it with $1OR, you keep everything, but
             replace the |.

(?:...)*  -- This is a non-matching group, which can be repeated multiple times; we
             use a group here so we can repeat multiple times alternative patterns.

[^"|\\]*  -- This is the first pattern. Anything that isn't a pipe, an escape character
             or a quote.

\\.       -- This is the second pattern. Basically, an escape character and anything
             that follows it.

"(?:...)*" -- This is the third pattern. Open quote, followed by a another
              non-matching group repeated multiple times, followed by a closing
              quote.

[^\\"]    -- This is the first pattern in the second non-matching group. It's anything
             except an escape character or a quote.

\\.       -- This is the second pattern in the second non-matching group. It's an
             escape character and whatever follows it.

结果如下:

" this \" | that" | "the | other" | "still | something | else"
" this \" | that" OR "the | other" OR "still | something | else"

1

另一种方法(类似于Alan M的可行方案):

str = str.replace(/(".+?"|\w+)\s*\|\s*/g, '$1 OR ');

第一个组中的部分(为了易读性而留有空格):

".+?"  |  \w+

...基本上意味着引用的内容或单词。其余部分表示后面跟着一个可选空格包裹的“|”。替换是第一部分(“$1”表示第一组)后跟着“ OR ”。


0

也许你正在寻找类似这样的东西:

(?<=^([^"]*"[^"]*")+[^"|]*)\|

0

谢谢大家。抱歉我忘记提到这是JavaScript,术语不需要加引号,可以有任意数量的带引号/不带引号的术语,例如:

"this | that" | "the | other" | yet | another  -> "this | that" OR "the | other" OR yet OR another 

Daniel,看起来这大致是一个匹配/处理循环。感谢您详细的解释。在js中,它看起来像是一个split,对术语数组进行forEach循环,将术语(在将|术语更改为OR后)推回到数组中,然后重新连接。

0

@Alan M,非常好用,由于SQLite FTS功能的稀疏性,不需要转义。

@epost,为简洁和优雅而接受的解决方案,谢谢。只需要将其以更一般的形式适用于Unicode等即可。

(".+?"|[^\"\s]+)\s*\|\s*

0
我的C#解决方案是先计算引号数量,然后使用正则表达式获取匹配项。
        // Count the number of quotes.
        var quotesOnly = Regex.Replace(searchText, @"[^""]", string.Empty);
        var quoteCount = quotesOnly.Length;
        if (quoteCount > 0)
        {
            // If the quote count is an odd number there's a missing quote.
            // Assume a quote is missing from the end - executive decision.
            if (quoteCount%2 == 1)
            {
                searchText += @"""";
            }

            // Get the matching groups of strings. Exclude the quotes themselves.
            // e.g. The following line:
            // "this and that" or then and "this or other"
            // will result in the following groups:
            // 1. "this and that"
            // 2. "or"
            // 3. "then"
            // 4. "and"
            // 5. "this or other"
            var matches = Regex.Matches(searchText, @"([^\""]*)", RegexOptions.Singleline);
            var list = new List<string>();
            foreach (var match in matches.Cast<Match>())
            {
                var value = match.Groups[0].Value.Trim();
                if (!string.IsNullOrEmpty(value))
                {
                    list.Add(value);
                }
            }

            // TODO: Do something with the list of strings.
       }

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接