使用PHP的substr()和strip_tags()函数保留格式并且不破坏HTML。

46

我有多个HTML字符串需要截取到100个字符(剥离内容而不是原始内容),而不剥离标签并且不破坏HTML。

原始HTML字符串(288个字符):

I have various HTML strings to cut to 100 characters (of the stripped content, not the original) without stripping tags and without breaking HTML.

$content = "<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div over <div class='nestedDivClass'>there</div>
</div> and a lot of other nested <strong><em>texts</em> and tags in the air
<span>everywhere</span>, it's a HTML taggy kind of day.</strong></div>";

标准剪裁:将内容裁剪为100个字符,保留HTML换行符,去除的内容约为40个字符:

$content = substr($content, 0, 100)."..."; /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div ove... */

去除了HTML标签: 可以正确计算字符数,但显然失去了格式:

$content = substr(strip_tags($content)), 0, 100)."..."; /* output:
With a span over here and a nested div over there and a lot of other nested
texts and tags in the ai... */

部分解决方案:使用HTML Tidy或净化器来关闭标签,可以输出干净的HTML代码,但是会导致100个字符的HTML代码未显示内容。

$content = substr($content, 0, 100)."...";
$tidy = new tidy; $tidy->parseString($content); $tidy->cleanRepair(); /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div ove</div></div>... */

挑战:输出干净的HTML和n个字符(不包括HTML元素的字符计数):

$content = cutHTML($content, 100); /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div over <div class='nestedDivClass'>there</div>
</div> and a lot of other nested <strong><em>texts</em> and tags in the
ai</strong></div>...";

类似的问题

12个回答

46

并不是很棒,但可以正常运作。

function html_cut($text, $max_length)
{
    $tags   = array();
    $result = "";

    $is_open   = false;
    $grab_open = false;
    $is_close  = false;
    $in_double_quotes = false;
    $in_single_quotes = false;
    $tag = "";

    $i = 0;
    $stripped = 0;

    $stripped_text = strip_tags($text);

    while ($i < strlen($text) && $stripped < strlen($stripped_text) && $stripped < $max_length)
    {
        $symbol  = $text{$i};
        $result .= $symbol;

        switch ($symbol)
        {
           case '<':
                $is_open   = true;
                $grab_open = true;
                break;

           case '"':
               if ($in_double_quotes)
                   $in_double_quotes = false;
               else
                   $in_double_quotes = true;

            break;

            case "'":
              if ($in_single_quotes)
                  $in_single_quotes = false;
              else
                  $in_single_quotes = true;

            break;

            case '/':
                if ($is_open && !$in_double_quotes && !$in_single_quotes)
                {
                    $is_close  = true;
                    $is_open   = false;
                    $grab_open = false;
                }

                break;

            case ' ':
                if ($is_open)
                    $grab_open = false;
                else
                    $stripped++;

                break;

            case '>':
                if ($is_open)
                {
                    $is_open   = false;
                    $grab_open = false;
                    array_push($tags, $tag);
                    $tag = "";
                }
                else if ($is_close)
                {
                    $is_close = false;
                    array_pop($tags);
                    $tag = "";
                }

                break;

            default:
                if ($grab_open || $is_close)
                    $tag .= $symbol;

                if (!$is_open && !$is_close)
                    $stripped++;
        }

        $i++;
    }

    while ($tags)
        $result .= "</".array_pop($tags).">";

    return $result;
}

使用示例:

$content = html_cut($content, 100);

1
我一直收到类似于“email us at<a href=&quot...... <a onclick="ShowAll()" style="cursor:pointer;" class="morelink">Read More</a>”这样的内容。我猜数据库中的内容可能有问题,但是有人有想法吗?谢谢! - Tommy B.
2
UTF-8支持怎么样? - SayJeyHi
仍然删除所有的<a>链接。 - Metacowboy
https://gist.github.com/getmanzooronline/61e341cb8de3d98ec12b - user3563059

20

我并不是声称自己发明了这个方法,但是在CakePHP中有一个非常完整的Text::truncate()方法能够实现你的要求:

function truncate($text, $length = 100, $ending = '...', $exact = true, $considerHtml = false) {
    if (is_array($ending)) {
        extract($ending);
    }
    if ($considerHtml) {
        if (mb_strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
            return $text;
        }
        $totalLength = mb_strlen($ending);
        $openTags = array();
        $truncate = '';
        preg_match_all('/(<\/?([\w+]+)[^>]*>)?([^<>]*)/', $text, $tags, PREG_SET_ORDER);
        foreach ($tags as $tag) {
            if (!preg_match('/img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param/s', $tag[2])) {
                if (preg_match('/<[\w]+[^>]*>/s', $tag[0])) {
                    array_unshift($openTags, $tag[2]);
                } else if (preg_match('/<\/([\w]+)[^>]*>/s', $tag[0], $closeTag)) {
                    $pos = array_search($closeTag[1], $openTags);
                    if ($pos !== false) {
                        array_splice($openTags, $pos, 1);
                    }
                }
            }
            $truncate .= $tag[1];

            $contentLength = mb_strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', ' ', $tag[3]));
            if ($contentLength + $totalLength > $length) {
                $left = $length - $totalLength;
                $entitiesLength = 0;
                if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', $tag[3], $entities, PREG_OFFSET_CAPTURE)) {
                    foreach ($entities[0] as $entity) {
                        if ($entity[1] + 1 - $entitiesLength <= $left) {
                            $left--;
                            $entitiesLength += mb_strlen($entity[0]);
                        } else {
                            break;
                        }
                    }
                }

                $truncate .= mb_substr($tag[3], 0 , $left + $entitiesLength);
                break;
            } else {
                $truncate .= $tag[3];
                $totalLength += $contentLength;
            }
            if ($totalLength >= $length) {
                break;
            }
        }

    } else {
        if (mb_strlen($text) <= $length) {
            return $text;
        } else {
            $truncate = mb_substr($text, 0, $length - strlen($ending));
        }
    }
    if (!$exact) {
        $spacepos = mb_strrpos($truncate, ' ');
        if (isset($spacepos)) {
            if ($considerHtml) {
                $bits = mb_substr($truncate, $spacepos);
                preg_match_all('/<\/([a-z]+)>/', $bits, $droppedTags, PREG_SET_ORDER);
                if (!empty($droppedTags)) {
                    foreach ($droppedTags as $closingTag) {
                        if (!in_array($closingTag[1], $openTags)) {
                            array_unshift($openTags, $closingTag[1]);
                        }
                    }
                }
            }
            $truncate = mb_substr($truncate, 0, $spacepos);
        }
    }

    $truncate .= $ending;

    if ($considerHtml) {
        foreach ($openTags as $tag) {
            $truncate .= '</'.$tag.'>';
        }
    }

    return $truncate;
}

1
如果你有一些偏移量怎么办?比如说获取前100个字符的HTML,然后下次从101-200取?这是可能的吗? - Afnan Bashir
1
谢谢你提供Cake Framework的参考。我已经复制了这个类,并且只重构了特定于异常类型的依赖项,以便尽快让现有遗留项目中的算法工作起来。 - kanevbgbe

13
使用PHP的DOMDocument类来规范化HTML片段:
$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');      
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
echo($dom->saveXml($body->item(0)));

这个问题类似于一个之前的问题,我在这里复制粘贴了一个解决方案。如果HTML是由用户提交的,你还需要过滤掉潜在的Javascript攻击向量,比如onmouseover="do_something_evil()"<a href="javascript:more_evil();">...</a>。像HTML Purifier这样的工具旨在捕捉和解决这些问题,比我能发布的任何代码都更全面。

6
我写了另一个函数来实现它,支持UTF-8:
/**
 * Limit string without break html tags.
 * Supports UTF8
 * 
 * @param string $value
 * @param int $limit Default 100
 */
function str_limit_html($value, $limit = 100)
{

    if (mb_strwidth($value, 'UTF-8') <= $limit) {
        return $value;
    }

    // Strip text with HTML tags, sum html len tags too.
    // Is there another way to do it?
    do {
        $len          = mb_strwidth($value, 'UTF-8');
        $len_stripped = mb_strwidth(strip_tags($value), 'UTF-8');
        $len_tags     = $len - $len_stripped;

        $value = mb_strimwidth($value, 0, $limit + $len_tags, '', 'UTF-8');
    } while ($len_stripped > $limit);

    // Load as HTML ignoring errors
    $dom = new DOMDocument();
    @$dom->loadHTML('<?xml encoding="utf-8" ?>'.$value, LIBXML_HTML_NODEFDTD);

    // Fix the html errors
    $value = $dom->saveHtml($dom->getElementsByTagName('body')->item(0));

    // Remove body tag
    $value = mb_strimwidth($value, 6, mb_strwidth($value, 'UTF-8') - 13, '', 'UTF-8'); // <body> and </body>
    // Remove empty tags
    return preg_replace('/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*\/?>\s*<\/\1\s*>/', '', $value);
}

查看演示.

我建议在函数开头使用html_entity_decode,这样可以保留UTF-8字符:

 $value = html_entity_decode($value);

5

2
+1 对于 HTML 解析器,并倾向于一个解决方案,如果输出文本不包括 HTML。一个快速的例子可能会很好。 - Peter Craig

3

1
+1 给 Tidy HTML(同时作为部分解决方案添加),但它不能处理输出文本的100个字符。 - Peter Craig

2

以下是我在我的项目中使用的一个函数。基于DOMDocument,适用于HTML5,比我尝试过的其他解决方案快大约2倍(至少在我的电脑上,在限制为400的500个文本节点字符字符串上使用顶部答案中的html_cut($text, $max_length)时,0.22毫秒对0.43毫秒)。

function cut_html ($html, $limit) {
    $dom = new DOMDocument();
    $dom->loadHTML(mb_convert_encoding("<div>{$html}</div>", "HTML-ENTITIES", "UTF-8"), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    cut_html_recursive($dom->documentElement, $limit);
    return substr($dom->saveHTML($dom->documentElement), 5, -6);
}

function cut_html_recursive ($element, $limit) {
    if($limit > 0) {
        if($element->nodeType == 3) {
            $limit -= strlen($element->nodeValue);
            if($limit < 0) {
                $element->nodeValue = substr($element->nodeValue, 0, strlen($element->nodeValue) + $limit);
            }
        }
        else {
            for($i = 0; $i < $element->childNodes->length; $i++) {
                if($limit > 0) {
                    $limit = cut_html_recursive($element->childNodes->item($i), $limit);
                }
                else {
                    $element->removeChild($element->childNodes->item($i));
                    $i--;
                }
            }
        }
    }
    return $limit;
}

2
无论您在开头提到的100个问题如何,您在挑战中指出了以下内容:
  • 输出strip_tags的字符计数(HTML实际显示文本中的字符数)
  • 保留HTML格式
  • 任何未完成的HTML标记
我的建议是:基本上,我会逐个字符解析并进行计数。我确保不计算任何HTML标签中的字符。同时,我会在结束时检查,以确保停止时不在单词的中间。一旦停止,我会回溯到第一个可用的空格或>作为终止点。
$position = 0;
$length = strlen($content)-1;

// process the content putting each 100 character section into an array
while($position < $length)
{
    $next_position = get_position($content, $position, 100);
    $data[] = substr($content, $position, $next_position);
    $position = $next_position;
}

// show the array
print_r($data);

function get_position($content, $position, $chars = 100)
{
    $count = 0;
    // count to 100 characters skipping over all of the HTML
    while($count <> $chars){
        $char = substr($content, $position, 1); 
        if($char == '<'){
            do{
                $position++;
                $char = substr($content, $position, 1);
            } while($char !== '>');
            $position++;
            $char = substr($content, $position, 1);
        }
        $count++;
        $position++;
    }
echo $count."\n";
    // find out where there is a logical break before 100 characters
    $data = substr($content, 0, $position);

    $space = strrpos($data, " ");
    $tag = strrpos($data, ">");

    // return the position of the logical break
    if($space > $tag)
    {
        return $space;
    } else {
        return $tag;
    }  
}

这也将计算返回代码等,考虑到它们会占用空间,我没有将它们删除。


1
感谢您的努力和考虑,最好不要切断最后一个单词,这样可以加1分。我猜您可以传递一个值,将字符串切割在前一个逻辑断点或下一个逻辑断点处。如果这很重要,下一个可能会超过字符限制。感谢您的努力! - Peter Craig

1

这是我的尝试,对于切割器。也许你们能够发现一些错误。我发现其他解析器的问题在于它们不能正确地关闭标签,而且会在单词(blah)中间进行切割。

function cutHTML($string, $length, $patternsReplace = false) {
    $i = 0;
    $count = 0;
    $isParagraphCut = false;
    $htmlOpen = false;
    $openTag = false;
    $tagsStack = array();

    while ($i < strlen($string)) {
        $char = substr($string, $i, 1);
        if ($count >= $length) {
            $isParagraphCut = true;
            break;
        }

        if ($htmlOpen) {
            if ($char === ">") {
                $htmlOpen = false;
            }
        } else {
            if ($char === "<") {
                $j = $i;
                $char = substr($string, $j, 1);

                while ($j < strlen($string)) {
                    if($char === '/'){
                        $i++;
                        break;
                    }
                    elseif ($char === ' ') {
                        $tagsStack[] = substr($string, $i, $j);
                    }
                    $j++;
                }
                $htmlOpen = true;
            }
        }

        if (!$htmlOpen && $char != ">") {
            $count++;
        }

        $i++;
    }

    if ($isParagraphCut) {
        $j = $i;
        while ($j > 0) {
            $char = substr($string, $j, 1);
            if ($char === " " || $char === ";" || $char === "." || $char === "," || $char === "<" || $char === "(" || $char === "[") {
                break;
            } else if ($char === ">") {
                $j++;
                break;
            }
            $j--;
        }
        $string = substr($string, 0, $j);
        foreach($tagsStack as $tag){
            $tag = strtolower($tag);
            if($tag !== "img" && $tag !== "br"){
                $string .= "</$tag>";
            }
        }
        $string .= "...";
    }

    if ($patternsReplace) {
        foreach ($patternsReplace as $value) {
            if (isset($value['pattern']) && isset($value["replace"])) {
                $string = preg_replace($value["pattern"], $value["replace"], $string);
            }
        }
    }
    return $string;
}

差不多了!但是当我使用那段代码时,仍然会得到:email us at<a href=&quot...... <a onclick="ShowAll()" style="cursor:pointer;" class="morelink">Read More</a>(与好答案相同)。有什么想法吗?谢谢! - Tommy B.
不确定,但看起来你从某个数据库中提取了文本,并且引号从 " 更改为 "e;,尝试使用 htmlspecialchars-decode(http://php.net/manual/en/function.htmlspecialchars-decode.php)函数。 - Jacob

0

请尝试以下方法:

<?php echo strip_tags(mb_strimwidth($VARIABLE_HERE, 0, 160, "...")); ?>

这将剥离 HTML(strip_tags)并限制字符数(mb_strimwidth)为160个字符


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接