将多字节字符串截断为 n 个字符

8

我正在尝试让这个字符串过滤器方法正常工作:

public function truncate($string, $chars = 50, $terminator = ' …');

I'd expect this

$in  = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890";
$out = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …";

并且还有这个

$in  = "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ";
$out = "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …";

这是$chars减去$terminator字符串的字符数。

此外,过滤器应该在$chars限制下第一个单词边界处截断,例如:

$in  = "Answer to the Ultimate Question of Life, the Universe, and Everything.";
$out = "Answer to the Ultimate Question of Life, the …";

我非常确定这些步骤可以实现

  • 从最大字符数中减去终止符的字符数
  • 验证字符串是否长于计算出的限制,如果是则返回未更改的字符串
  • 在计算出的限制以下找到字符串中的最后一个空格字符以获取单词边界
  • 在最后一个空格或计算出的限制处截断字符串,如果没有找到最后一个空格,则截取到计算出的限制
  • 将终止符附加到字符串
  • 返回字符串

然而,我已经尝试了各种组合的str*mb_*函数,但结果都不正确。这不可能那么难,所以我显然漏掉了什么。能否有人分享一个可行的实现方法,或者指导我如何做才能理解。

谢谢

P.S. 是的,在此之前我已经检查过https://stackoverflow.com/search?q=truncate+string+php :)


你可能会发现s($str)->truncateSafely(50)很有用,它可以在这个独立库中找到。 - caw
4个回答

5

刚刚发现PHP已经有了多字节截断函数:

虽然它不遵守单词边界,但还是很方便的!


3

试试这个:

function truncate($string, $chars = 50, $terminator = ' …') {
    $cutPos = $chars - mb_strlen($terminator);
    $boundaryPos = mb_strrpos(mb_substr($string, 0, mb_strpos($string, ' ', $cutPos)), ' ');
    return mb_substr($string, 0, $boundaryPos === false ? $cutPos : $boundaryPos) . $terminator;
}

但是您需要确保内部编码设置正确。


哎呀,我试图让它从ISO-8859-1工作。现在改成UTF-8了。谢谢Gumbo。我接受这个作为正确答案,因为它包含了我所缺少的唯一东西。 - Gordon
可能存在一个漏洞。我得到了警告:mb_strpos():偏移量不包含在字符串中.... - snowflake

0

我通常不喜欢只是编写整个答案来回答这样的问题。但是我刚醒来,我想也许你的问题会让我心情愉快地去编程一整天。

我没有尝试运行这个代码,但它应该可以工作,或者至少能让你完成90%的工作。

function truncate( $string, $chars = 50, $terminate = ' ...' )
{
    $chars -= mb_strlen($terminate);
    if ( $chars <= 0 )
        return $terminate;

    $string = mb_substr($string, 0, $chars);
    $space = mb_strrpos($string, ' ');

    if ($space < mb_strlen($string) / 2)
        return $string . $terminate;
    else
        return mb_substr($string, 0, $space) . $terminate;
}

0

简而言之;

  • 足够短的字符串不应该加上省略号。
  • 换行符也应该作为限定断点。
  • 正则表达式一旦被分解和解释,就不会太可怕。

我认为有一些关于这个问题和当前答案的重要事情需要指出。我将演示一下答案的比较,以及基于Gordon的样本数据和一些其他情况的正则表达式答案,以揭示一些不同的结果。

首先,澄清输入值的质量。Gordon说函数需要支持多字节并且尊重单词边界。样本数据没有暴露非空格、非单词字符(例如标点符号)在确定截断位置时所需的处理方式,因此我们必须假设针对空格字符是足够的——而且明智的是,因为大多数“阅读更多”字符串在截断时不会担心尊重标点符号。

其次,有相当常见的情况需要对包含换行符的大段文本应用省略号。

第三,让我们随意地同意一些基本的数据标准化,例如:

  • 字符串已经修剪掉所有前导/尾随空白字符
  • $chars的值将始终大于$terminatormb_strlen()

(演示)

函数:

function truncateGumbo($string, $chars = 50, $terminator = ' …') {
    $cutPos = $chars - mb_strlen($terminator);
    $boundaryPos = mb_strrpos(mb_substr($string, 0, mb_strpos($string, ' ', $cutPos)), ' ');
    return mb_substr($string, 0, $boundaryPos === false ? $cutPos : $boundaryPos) . $terminator;
}

function truncateGordon($string, $chars = 50, $terminator = ' …') {
    return mb_strimwidth($string, 0, $chars, $terminator);
}

function truncateSoapBox($string, $chars = 50, $terminate = ' …')
{
    $chars -= mb_strlen($terminate);
    if ( $chars <= 0 )
        return $terminate;

    $string = mb_substr($string, 0, $chars);
    $space = mb_strrpos($string, ' ');

    if ($space < mb_strlen($string) / 2)
        return $string . $terminate;
    else
        return mb_substr($string, 0, $space) . $terminate;
}

function truncateMickmackusa($string, $max = 50, $terminator = ' …') {
    $trunc = $max - mb_strlen($terminator, 'UTF-8');
    return preg_replace("~(?=.{{$max}})(?:\S{{$trunc}}|.{0,$trunc}(?=\s))\K.+~us", $terminator, $string);
}

测试用例:

$tests = [
    [
        'testCase' => "Answer to the Ultimate Question of Life, the Universe, and Everything.",
        // 50th char ---------------------------------------------------^
        'expected' => "Answer to the Ultimate Question of Life, the …",
    ],
    [
        'testCase' => "A single line of text to be followed by another\nline of text",
        // 50th char ----------------------------------------------------^
        'expected' => "A single line of text to be followed by another …",
    ],
    [
        'testCase' => "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ",
        // 50th char ---------------------------------------------------^
        'expected' => "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …",
    ],
    [
        'testCase' => "123456789 123456789 123456789 123456789 123456789",
        // 50th char doesn't exist -------------------------------------^
        'expected' => "123456789 123456789 123456789 123456789 123456789",
    ],
    [
        'testCase' => "Hello worldly world",
        // 50th char doesn't exist -------------------------------------^
        'expected' => "Hello worldly world",
    ],
    [
        'testCase' => "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890",
        // 50th char ---------------------------------------------------^
        'expected' => "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …",
    ],
];

执行:

foreach ($tests as ['testCase' => $testCase, 'expected' => $expected]) {
    echo "\tSample Input:\t\t$testCase\n";
    echo "\n\ttruncateGumbo:\t\t" , truncateGumbo($testCase);
    echo "\n\ttruncateGordon:\t\t" , truncateGordon($testCase);
    echo "\n\ttruncateSoapBox:\t" , truncateSoapBox($testCase);
    echo "\n\ttruncateMickmackusa:\t" , truncateMickmackusa($testCase);
    echo "\n\tExpected Result:\t{$expected}";
    echo "\n-----------------------------------------------------\n";
}

输出:

    Sample Input:           Answer to the Ultimate Question of Life, the Universe, and Everything.

    truncateGumbo:          Answer to the Ultimate Question of Life, the …
    truncateGordon:         Answer to the Ultimate Question of Life, the Uni …
    truncateSoapBox:        Answer to the Ultimate Question of Life, the …
    truncateMickmackusa:    Answer to the Ultimate Question of Life, the …
    Expected Result:        Answer to the Ultimate Question of Life, the …
-----------------------------------------------------
    Sample Input:           A single line of text to be followed by another
line of text

    truncateGumbo:          A single line of text to be followed by …
    truncateGordon:         A single line of text to be followed by another
 …
    truncateSoapBox:        A single line of text to be followed by …
    truncateMickmackusa:    A single line of text to be followed by another …
    Expected Result:        A single line of text to be followed by another …
-----------------------------------------------------
    Sample Input:           âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ

    truncateGumbo:          âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …
    truncateGordon:         âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …
    truncateSoapBox:        âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …
    truncateMickmackusa:    âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …
    Expected Result:        âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …
-----------------------------------------------------
    Sample Input:           123456789 123456789 123456789 123456789 123456789

    truncateGumbo:          123456789 123456789 123456789 123456789 12345678 …
    truncateGordon:         123456789 123456789 123456789 123456789 123456789
    truncateSoapBox:        123456789 123456789 123456789 123456789 …
    truncateMickmackusa:    123456789 123456789 123456789 123456789 123456789
    Expected Result:        123456789 123456789 123456789 123456789 123456789
-----------------------------------------------------
    Sample Input:           Hello worldly world

    truncateGumbo:          
Warning: mb_strpos(): Offset not contained in string in /in/ibFH5 on line 4
Hello worldly world …
    truncateGordon:         Hello worldly world
    truncateSoapBox:        Hello worldly …
    truncateMickmackusa:    Hello worldly world
    Expected Result:        Hello worldly world
-----------------------------------------------------
    Sample Input:           abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890

    truncateGumbo:          abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    truncateGordon:         abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    truncateSoapBox:        abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    truncateMickmackusa:    abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    Expected Result:        abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
-----------------------------------------------------

我的模式解释:

尽管看起来相当丑陋,但大部分混乱的模式语法都是将数字值作为动态量词插入的问题。

我也可以这样写:

'~(?:\S{' . $trunc . '}|(?=.{' . $max . '}).{0,' . $trunc . '}(?=\s))\K.+~us'

为简单起见,我将用“48”代替“$trunc”,用“50”代替“$max”。
~                 #opening pattern delimiter
(?=.{50})         #lookahead to ensure that the string has a minimum of 50 characters
(?:               #start of non-capturing group -- to maintain pattern logic only
  \S{48}          #the string starts with at least 48 non-white-space characters
  |               #or
  .{0,48}(?=\s)   #the string starts with upto 48 characters followed by a whitespace
)                 #end of non-capturing group
\K                #restart the fullstring match (aka "forget" the previously matched characters)
.+                #match the remaining characters (these characters will be replaced)
~                 #closing pattern delimiter
us                #pattern modifiers: unicode/multibyte flag & dot matches newlines flag

抱歉 @Gordon,这是一篇很长的文章,但我觉得分享和比较是非常有价值的。 - mickmackusa

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接