在PHP中使用preg_match和UTF-8

Question

在PHP中使用preg_match和UTF-8

48

我正在尝试使用preg_match搜索一个UTF8编码的字符串。

preg_match('/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
echo $a_matches[0][1];

这应该打印出1，因为"H"在字符串"¡Hola!"的索引1处。但它打印出2。所以看起来它没有将主题视为UTF8编码的字符串，尽管我正在在正则表达式中传递"u" modifier。

我在我的php.ini中有以下设置，其他UTF8函数都可以工作：

mbstring.func_overload = 7
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = pass
mbstring.http_output = pass
mbstring.encoding_translation = Off

任何想法？

- JW.

1

请参考以下链接：https://dev59.com/LXI95IYBdhLWcg3wtwVe - Artefacto

9个回答

29

尝试在正则表达式之前添加(*UTF8)：

preg_match('(*UTF8)/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);

感谢一位评论者在https://www.php.net/manual/function.preg-match.php#95828中提供的神奇方法。

- Natxet

4

有趣的是，虽然我认为在(*UTF8)之前需要初始的/。在我的系统上这个不起作用，但在其他系统上可能会有效。当你执行 echo $a_matches[0][1];时，输出的是什么？ - JW.

2

我在 PHP 5.4.29 上像这样使用它，非常好用：preg_match_all('/(*UTF8)[^A-Za-z0-9\s]/', $txt, $matches); - Novalis

5

无论是在 PHP 5.6 还是 PHP 7 上，在 Ubuntu 16.04 上都无法正常工作。定界符之前的 (*UTF8) 是一个错误，之后没有影响。我怀疑这取决于你获取 PHP 的方式/位置，特别是 libpcre* 编译时的设置。 - user2609094

2

不会为我更改偏移量，但这是一个有趣的事情要知道。该“功能”的原始文档为：http://www.pcre.org/pcre.txt - BurninLeo

24

看起来这是一个“特性”，参见http://bugs.php.net/bug.php?id=37391

'u'选项只对pcre有意义，PHP本身不知道它。

从PHP的角度来看，字符串是字节序列，返回字节偏移量似乎��合理的（我并不说“正确”）。

- user187291

4

好的，他们没有提供 mb_preg_replace 函数。 - JW.

请注意，关于 utf-8 处理的相同“规则”也适用于第五个参数 $offset。示例： var_dump(preg_match('/#/u', "\xc3\xa4#",$matches,0,2)); - AthanasiusKirchner

1

PHP知道u修饰符，它在手册中列出，请参见“u（PCRE_UTF8）”。http://php.net/manual/en/reference.pcre.pattern.modifiers.php - Walt Sorensen

9

抱歉打扰，但可能有人会发现以下代码很有用：下面的代码可以替代preg_match和preg_match_all函数，对于UTF8编码的字符串，它返回正确匹配并带有正确的偏移量。

     mb_internal_encoding('UTF-8');

     /**
     * Returns array of matches in same format as preg_match or preg_match_all
     * @param bool   $matchAll If true, execute preg_match_all, otherwise preg_match
     * @param string $pattern  The pattern to search for, as a string.
     * @param string $subject  The input string.
     * @param int    $offset   The place from which to start the search (in bytes).
     * @return array
     */
    function pregMatchCapture($matchAll, $pattern, $subject, $offset = 0)
    {
        $matchInfo = array();
        $method    = 'preg_match';
        $flag      = PREG_OFFSET_CAPTURE;
        if ($matchAll) {
            $method .= '_all';
        }
        $n = $method($pattern, $subject, $matchInfo, $flag, $offset);
        $result = array();
        if ($n !== 0 && !empty($matchInfo)) {
            if (!$matchAll) {
                $matchInfo = array($matchInfo);
            }
            foreach ($matchInfo as $matches) {
                $positions = array();
                foreach ($matches as $match) {
                    $matchedText   = $match[0];
                    $matchedLength = $match[1];
                    $positions[]   = array(
                        $matchedText,
                        mb_strlen(mb_strcut($subject, 0, $matchedLength))
                    );
                }
                $result[] = $positions;
            }
            if (!$matchAll) {
                $result = $result[0];
            }
        }
        return $result;
    }

    $s1 = 'Попробуем русскую строку для теста';
    $s2 = 'Try english string for test';

    var_dump(pregMatchCapture(true, '/обу/', $s1));
    var_dump(pregMatchCapture(false, '/обу/', $s1));

    var_dump(pregMatchCapture(true, '/lish/', $s2));
    var_dump(pregMatchCapture(false, '/lish/', $s2));

我的示例输出：

    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(6) "обу"
          [1]=>
          int(4)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(6) "обу"
        [1]=>
        int(4)
      }
    }
    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(4) "lish"
          [1]=>
          int(7)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(4) "lish"
        [1]=>
        int(7)
      }
    }

- Guy Fawkes

2

你能解释一下你的代码是做什么的，而不只是粘贴一堆代码吗？这个回答如何回答问题？ - nhahtdh

2

它完全按照注释中描述的执行，并返回正确的字符串偏移量。这是问题的主题。不知道为什么我的答案会有-2。对我来说它是有效的。 - Guy Fawkes

这就是为什么你应该包含一个解释你的代码是做什么的。人们不明白你在这里尝试做什么。 - nhahtdh

1

编辑我的答案，添加测试。 - Guy Fawkes

1

一条旧的“灵魂复活”评论，但仍然有用！谢谢@GuyFawkes，这对我正在处理的代码混乱问题有所帮助。干杯，jz。 - J.Z.

显示剩余2条评论

3

您可以通过使用字节计数的 substr 函数将字符串截断到由 preg_match 返回的偏移量，然后使用正确计数的 mb_strlen 函数测量该前缀来计算实际的 UTF-8 偏移量。

$utf8Offset = mb_strlen(substr($text, 0, $offsetFromPregMatch), 'UTF-8');

- fracz

1

如果你只是想找到 H 的多字节安全位置，可以尝试使用 mb_strpos() 函数。

mb_internal_encoding('UTF-8');
$str = "\xC2\xA1Hola!";
$pos = mb_strpos($str, 'H');
echo $str."\n";
echo $pos."\n";
echo mb_substr($str,$pos,1)."\n";

输出：

¡Hola!
1
H

- velcrow

那只是一个简化的例子，但对于其他人可能会有用。 - JW.

1

我写了一个小类来将 preg_match 返回的偏移量转换为正确的 utf 偏移量：

final class NonUtfToUtfOffset
{
    /** @var int[] */
    private $utfMap = [];

    public function __construct(string $content)
    {
        $contentLength = mb_strlen($content);

        for ($offset = 0; $offset < $contentLength; $offset ++) {
            $char = mb_substr($content, $offset, 1);
            $nonUtfLength = strlen($char);

            for ($charOffset = 0; $charOffset < $nonUtfLength; $charOffset ++) {
                $this->utfMap[] = $offset;
            }
        }
    }

    public function convertOffset(int $nonUtfOffset): int
    {
        return $this->utfMap[$nonUtfOffset];
    }
}

你可以像这样使用它：

$content = 'aą bać d';
$offsetConverter = new NonUtfToUtfOffset($content);

preg_match_all('#(bać)#ui', $content, $m, PREG_OFFSET_CAPTURE);

foreach ($m[1] as [$word, $offset]) {
    echo "bad: " . mb_substr($content, $offset, mb_strlen($word))."\n";
    echo "good: " . mb_substr($content, $offsetConverter->convertOffset($offset), mb_strlen($word))."\n";
}

https://3v4l.org/8Y32J

- bronek89

1

你可能想要看一下T-Regx库。

pattern('/Hola/u')->match('\xC2\xA1Hola!')->first(function (Match $match) 
{
    echo $match->offset();     // characters
    echo $match->byteOffset(); // bytes
});

这个$match->offset()是UTF-8安全的偏移量。

- Danon

0

问题就是在使用 casual 的 substr 而不是预期的 mb_substr（PHP 7.4）时解决了。

mb_substr 与 preg_match_all / PREG_OFFSET_CAPTURE 结合使用（无论是否使用 /u 修饰符）会导致当文本包含欧元符号（€）时位置不正确。

此外，iconv 和 utf8_encode 也没有帮助，我也无法使用 htmlentities。

只需回到简单的 substr 就可以解决问题，并且能够正确处理 € 和其他字符。

- Mike

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Gumbo · Accepted Answer

52

尽管使用u修饰符可以使模式和主题都被解释为UTF-8，但捕获的偏移仍然以字节计数。

您可以使用mb_strlen以获取UTF-8字符而不是字节长度：

$str = "\xC2\xA1Hola!";
preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE);
echo mb_strlen(substr($str, 0, $a_matches[0][1]));

- Gumbo

3

“u”修饰符仅用于将模式解释为UTF-8格式，而不是主题内容。这并不正确。例如将preg_split('//', .)与preg_split('//u', .)进行比较。由于“x被解释为UTF-8”有点含糊，因此请参阅this以获得Unicode模式的实际效果。 - Artefacto

2

根据http://nl1.php.net/manual/en/reference.pcre.pattern.modifiers.php#103348，*u*修饰符对模式和输入都有影响。 - Lode

1

@tomalak和后面的人。当然，如果您使用像substr、strlen等旧函数，php不会管理unicode，因为它按字节工作，但是通过扩展mbstring（默认在许多发行版和服务器中启用），它已经完全管理了很长一段时间。这是为了保持向后兼容性而做出的选择。 - Daniel-KM

自从我开始将所有旧网站转换为Unicode 4-5年以来，我在PHP中使用UTF-8没有任何问题。 - TheStoryCoder

4

“老兄，现在已经2019年了，PHP在Unicode方面仍然非常糟糕。” 请确认。 - Pathros

显示剩余3条评论