如何避免出现无法匹配 - preg_match_all('/"[\p{L}\p{Nd}а-яА-ЯёЁ -_\.\+]+"/ui', $outStr, $matches);
你是说非贪心模式,也就是找到最短的匹配而不是最长的匹配吗?*
,+
和?
这些量词默认情况下是贪心的,会尽可能匹配更多字符。在它们后面加上问号可以让它们变成非贪心的。
preg_match_all('/"[\p{L}\p{Nd}а-яА-ЯёЁ -_\.\+]+?"/ui', $outStr, $matches);
贪婪匹配:
"foo" and "bar"
^^^^^^^^^^^^^^^
非贪婪匹配:
"foo" and "bar"
^^^^^
请参考:http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
U (PCRE_UNGREEDY)
该修饰符颠倒了量词的"贪婪性",使它们默认情况下不是贪婪的,但如果在?后面跟随,它们会变得贪婪。它与Perl不兼容。它也可以通过模式中的(?U)修改器设置或在量词后面加上一个问号(例如.*?)设置。
(?U)
标志时有所不同。在PHP中,它打开了PCRE_UNGREEDY
正则表达式编译标志,但在JDK7中,它打开了UNICODE_CHARACTER_CLASS
正则表达式编译标志,以使字符类符合Unicode正则表达式规范 - 这是PHP已经默认执行的(我相信!),因为Perl已经执行了。嗯,阅读pcrepattern手册让我有点怀疑。看起来只有[\pL\pN_]
,这并不完全符合上面引用的RL1.2要求。但比ASCII好。 - tchrist(?U)
修饰符是PCRE(以及PHP和R等衍生品)独有的,而在JavaScript,Python或Perl等编程语言中找不到它。早期的评论指出它在Java中的行为完全不同。 - Adam Katz/"[\p{L}\p{Nd}а-яА-ЯёЁ -_\.\+]+"/ui
/"[\pL\p{Nd}а-яА-ЯёЁ -_.+]+"/ui
\x{⋯}
转义字符:/"[\pL\p{Nd}\x{430}-\x{44F}\x{410}-\x{42F}\x{451}\x{401} -_.+]+"/ui
使用命名字符是:
/"[\pL\p{Nd}\N{CYRILLIC SMALL LETTER A}-\N{CYRILLIC SMALL LETTER YA}\N{CYRILLIC CAPITAL LETTER A}-\N{CYRILLIC CAPITAL LETTER YA}\N{CYRILLIC SMALL LETTER IO}\N{CYRILLIC CAPITAL LETTER IO} -_.+]+"/ui
uniquote -x
,第二个使用 uniquote -v
。U+0410 ‹А› \N{CYRILLIC CAPITAL LETTER A}
U+0430 ‹а› \N{CYRILLIC SMALL LETTER A}
U+0401 ‹Ё› \N{CYRILLIC CAPITAL LETTER IO}
U+0451 ‹ё› \N{CYRILLIC SMALL LETTER IO}
for:
U+0041 ‹A› \N{LATIN CAPITAL LETTER A}
U+0061 ‹a› \N{LATIN SMALL LETTER A}
U+00CB ‹Ë› \N{LATIN CAPITAL LETTER E WITH DIAERESIS}
U+00EB ‹ë› \N{LATIN SMALL LETTER E WITH DIAERESIS}
/"[\pL\p{Nd} -_.+]+"/ui
/"[\pL\p{Nd} -_.+]+"/u
+
替换为其对应的最小版本+?
即可:/"[\pL\p{Nd} -_.+]+?"/u
[ -_]
这个范围感到担忧,也就是说,\p{SPACE}-\p{LOW LINE}
。
我觉得这是一个非常奇怪的范围。它意味着任何以下的字符:!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
% unichars -g '\p{ASCII}' '[\pS\pP]' 'ord() < ord(" ") || ord() > ord("_")'
` U+0060 GC=Sk GRAVE ACCENT
{ U+007B GC=Ps LEFT CURLY BRACKET
| U+007C GC=Sm VERTICAL LINE
} U+007D GC=Pe RIGHT CURLY BRACKET
~ U+007E GC=Sm TILDE
/"[\pL\p{Nd}\s\pS\pP]+?"/u
U+0401 ‹Ё› \N{CYRILLIC CAPITAL LETTER IO}
U+0451 ‹ё› \N{CYRILLIC SMALL LETTER IO}
NFD("\N{CYRILLIC CAPITAL LETTER IO}") => "\N{CYRILLIC SMALL LETTER IE}\N{COMBINING DIAERESIS}"
NFD("\N{CYRILLIC SMALL LETTER IO}") => "\N{CYRILLIC CAPITAL LETTER IE}\N{COMBINING DIAERESIS}"
% uniprops "COMBINING DIAERESIS"
U+0308 ‹◌̈› \N{COMBINING DIAERESIS}
\w \pM \p{Mn}
All Any Assigned InCombiningDiacriticalMarks Case_Ignorable CI Combining_Diacritical_Marks Dia Diacritic M Mn Gr_Ext Grapheme_Extend Graph GrExt ID_Continue IDC Inherited Zinh Mark Nonspacing_Mark Print Qaai Word XID_Continue XIDC
/"[\pL\pM\p{Nd}\s\pS\pP]+?"/u
/"(?:(?=[\p{Latin}\p{Cyrillic}])[\pL\pM\p{Nd}\s\pS\pP])+?"/u
Common
来获取数字和各种标点符号和符号,以及需要使用Inherited
来处理跟随字母的组合标记。这就把我们带到了这里:
/"(?:(?=[\p{Latin}\p{Cyrillic}\p{Common}\p{Inherited}])[\pL\pM\p{Nd}\s\pS\pP])+?"/u
/"(?:(?!")(?=[\p{Latin}\p{Cyrillic}\p{Common}\p{Inherited}])[\pL\pM\p{Nd}\s\pS\pP])+"/u
/
" # literal double quote
(?:
### This group specifies a single char with
### three separate constraints:
# Constraint 1: next char must NOT be a double quote
(?!")
# Constraint 2: next char must be from one of these four scripts
(?=[\p{Latin}\p{Cyrillic}\p{Common}\p{Inherited}])
# Constraint 3: match one of either Letter, Mark, Decimal Number,
# whitespace, Symbol, or Punctuation:
[\pL\pM\p{Nd}\s\pS\pP]
) # end constraint group
+ # repeat entire group 1 or more times
" # and finally match another double-quote
/ux
m{⋯}xu
来写。m{
" # literal double quote
(?:
### This group specifies a single char with
### three separate constraints:
# Constraint 1: next char must NOT be a double quote
(?!")
# Constraint 2: next char must be from one of these four scripts
(?=[\p{Latin}\p{Cyrillic}\p{Common}\p{Inherited}])
# Constraint 3: match one of either Letter, Mark, Decimal Number,
# whitespace, Symbol, or Punctuation:
[\pL\pM\p{Nd}\s\pS\pP]
) # end constraint group
+ # repeat entire group 1 or more times
" # and finally match another double-quote
}ux
*
,+
,?
和{n,m}
是“最大集合”;*?
,+?
,??
和{n,m}?
是“最小集合”;而*+
,++
和{n,m}+
则是“占有集合”。为了完整起见,我想加上?+
,但它不会改变其功能:请认真思考。 - tchrist