在Perl正则表达式和grep中如何否定括号内的字符类？

Question

在Perl正则表达式和grep中如何否定括号内的字符类？

6

我试图解决一个非常简单的问题——找出仅包含某些特定字母的字符串数组。但是，我遇到了正则表达式和/或“grep”行为中的一些问题，不理解它们的行为。

#!/usr/bin/perl

use warnings;
use strict;

my @test_data = qw(ant bee cat dodo elephant frog giraffe horse);

# Words wanted include these letters only. Hardcoded for demonstration purposes
my @wanted_letters = qw/a c d i n o t/;

# Subtract those letters from the alphabet to find the letters to eliminate.
# Interpolate array into a negated bracketed character class, positive grep
# against a list of the lowercase alphabet: fine, gets befghjklmpqrsuvwxyz.
my @unwanted_letters = grep(/[^@wanted_letters]/, ('a' .. 'z'));

# The desired result can be simulated by hardcoding the unwanted letters into a
# bracketed character class then doing a negative grep: matches ant, cat, and dodo.
my @works = grep(!/[befghjklmpqrsuvwxyz]/, @test_data);

# Doing something similar but moving the negation into the bracketed character
# class fails and matches everything.
my @fails1 = grep(/[^befghjklmpqrsuvwxyz]/, @test_data);

# Doing the same thing that produced the array of unwanted letters also fails.
my @fails2 = grep(/[^@unwanted_letters]/, @test_data);

print join ' ', @works; print "\n";
print join ' ', @fails1; print "\n";
print join ' ', @fails2; print "\n";

问题：

@works 可以得到正确的结果，但是 @fails1 为什么不行？grep函数文档建议使用前者，而 perlrecharclass 的not运算符部分则建议使用后者，尽管它在例子中使用了 =~。这与使用 grep 有关吗？
@fails2 为什么不起作用？这与数组 vs 列表上下文有关吗？除此之外，是否有纯正则表达式的方法可以避免减法步骤？
除此之外，是否有纯正则表达式的方法可以避免减法步骤？

- Scott Martin

2个回答

4

您正在匹配字符串中字符集外的某些内容。但是，该字符串的其他位置仍然可以有字符集中的字符。例如，如果测试单词为elephant，否定字符类会匹配a字符。

如果您想测试整个字符串，您需要量化它并锚定到末尾。

grep(/^[^befghjklmpqrsuvwxyz]*$/, @test_data);

翻译成中文，它的区别在于“单词不包含集合中的任何字符”和“单词包含集合中没有的字符”。

- Barmar

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- dawg · Accepted Answer

通过添加锚点符号^和$以及量词符号+，两个fails都得到了修复。

这两个都可以工作：

my @fails1 = grep(/^[^befghjklmpqrsuvwxyz]+$/, @test_data);
my @fails2 = grep(/^[^@unwanted_letters]+$/, @test_data);

请记住，/[^befghjklmpqrsuvwxyz]/ 或 /[^@unwanted_letters]/ 只匹配一个字符。添加+ 将尽可能多地匹配。添加^ 和 $ 表示从字符串的开头到结尾的所有字符。

使用/[@wanted_letters]/将返回匹配项，如果有一个想要的字符（即使在字符串中有不想要的字符）--逻辑上等价于any。与/^[@wanted_letters]+$/相比，其中所有字母都需要在集合@wanted_letters中，并且是等价于all。演示1只匹配一个字符，因此grep失败。演示2量词意味着超过一个但没有锚点- grep失败演示3锚点和量词-预期结果。

一旦你理解字符类只匹配一个字符，锚点用于整个字符串和量词用于将匹配扩展到锚点的所有内容，你可以直接使用想要的字母进行grep。

my @wanted = grep(/^[@wanted_letters]+$/, @test_data);