如何找到所有的空格，但不包括引号内的空格？

Question

如何找到所有的空格，但不包括引号内的空格？

7

我需要按空格拆分字符串，但是引号中的短语应该保持未拆分状态。例如：

  word1 word2 "this is a phrase" word3 word4 "this is a second phrase" word5

这应该在 preg_split 后生成数组：

array(
 [0] => 'word1',
 [1] => 'word2',
 [2] => 'this is a phrase',
 [3] => 'word3',
 [4] => 'word4',
 [5] => 'this is a second phrase',
 [6]  => 'word5'
)

我该如何编写正则表达式来实现这个功能？

PS. 有相关问题，但我认为它在我的情况下不起作用。接受的答案提供了查找单词而非空格的正则表达式。

- altern

那个相关的问题看起来正是你想做的，基于你们两个给出的例子。你试过那个被接受的答案了吗？发生了什么事？ - richsage

是的，我试过了。我使用的是 PHP，而不是 .NET。我无法使用正则表达式结果的内联过滤。并且，就像我说的那样，\w+ |“[\w\s]*” 对我也不起作用。 - altern

5个回答

0

假设您的引号已经定义完好，即成对出现，您可以将其拆分并通过for循环遍历每两个字段。例如：

$str = "word1 word2 \"this is a phrase\" word3 word4 \"this is a second phrase\" word5 word6 \"lastword\"";
print $str ."\n";
$s = explode('"',$str);
for($i=1;$i<count($s);$i+=2){
    if ( strpos($s[$i] ," ")!==FALSE) {
        print "Spaces found: $s[$i]\n";
    }
}

输出

$ php test.php
Spaces found: this is a phrase
Spaces found: this is a second phrase

无需复杂的正则表达式。

- ghostdog74

当然我可以不用正则表达式来做这个，但这并不是我的情况。 - altern

0

使用你链接的另一个问题中的正则表达式，这很容易吧？

<?php

$string = 'word1 word2 "this is a phrase" word3 word4 "this is a second phrase" word5';

preg_match_all( '/(\w+|"[\w\s]*")+/' , $string , $matches );

print_r( $matches[1] );

?>

输出：

Array
(
     [0] => word1
     [1] => word2
     [2] => "this is a phrase"
     [3] => word3
     [4] => word4
     [5] => "this is a second phrase"
     [6] => word5
)

- edds

特殊字符（例如“&”符号）怎么办？这些也应该被找到吗？而且不仅仅是“&”符号会未被处理。此外，不同的符号应该有不同的处理方式。例如，如果遇到大括号，我需要将其包含在搜索结果中。 - altern

1

@altern，嗯，我相信edds不介意你根据自己的需求调整他的示例... - Bart Kiers

0

有人想要对标记化和正则表达式进行基准测试吗？我猜explode（）函数对于任何速度优势来说都有点太重了。尽管如此，这里是另一种方法： < p > < em >（编辑，因为我忘记了存储引用字符串的else情况）

$str = 'word1 word2 "this is a phrase" word3 word4 "this is a second phrase" word5';

// initialize storage array
$arr = array();
// initialize count
$count = 0;
// split on quote
$tok = strtok($str, '"');
while ($tok !== false) {
    // even operations not in quotes
    $arr = ($count % 2 == 0) ? 
                               array_merge($arr, explode(' ', trim($tok))) :
                               array_merge($arr, array(trim($tok)));
    $tok = strtok('"');
    ++$count;
}

// output results
var_dump($arr);

- Corey Ballou

0

$test = 'word1 word2 "this is a phrase" word3 word4 "this is a second phrase" word5';
preg_match_all( '/([^"\s]+)|("([^"]+)")/', $test, $matches);

- Amarghosh

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- altern · Accepted Answer

在 #regex irc 频道 (irc.freenode.net) 中，用户 MizardX 的帮助下找到了解决方案。它甚至支持单引号。

$str= 'word1 word2 \'this is a phrase\' word3 word4 "this is a second phrase" word5 word1 word2 "this is a phrase" word3 word4 "this is a second phrase" word5';

$regexp = '/\G(?:"[^"]*"|\'[^\']*\'|[^"\'\s]+)*\K\s+/';

$arr = preg_split($regexp, $str);

print_r($arr);

结果是：

Array (
    [0] => word1
    [1] => word2
    [2] => 'this is a phrase'
    [3] => word3
    [4] => word4
    [5] => "this is a second phrase"
    [6] => word5
    [7] => word1
    [8] => word2
    [9] => "this is a phrase"
    [10] => word3
    [11] => word4
    [12] => "this is a second phrase"
    [13] => word5  
)

注意：唯一的缺点是这个正则表达式只适用于PCRE 7。

结果发现，我的生产服务器上没有PCRE 7支持，只安装了PCRE 6。尽管它不像之前的PCRE 7那样灵活，但可以使用的正则表达式是（去掉了\G和\K）：

/(?:"[^"]*"|\'[^\']*\'|[^"\'\s]+)+/

对于给定的输入，结果与上面相同。