如何使用正则表达式删除一个标签及其内容？

Question

如何使用正则表达式删除一个标签及其内容？

phpregex

11

$str = '一些文本内容更多文本';

我的问题是：如何检索位于<MY_TAG> .. </MY_TAG>之间的内容<em>内容</em>？

和

如何从$str中删除<MY_TAG>及其内容？

我正在使用PHP。

谢谢。

- user187580

4

我想知道在一天内有多少次会链接到以下答案：https://dev59.com/X3I-5IYBdhLWcg3wq6do#1732454 - Nicole

HTML解析器，啥啥啥的... 你知道这套路。 - Ignacio Vazquez-Abrams

5个回答

12

如果MY_TAG不能嵌套，尝试使用以下方法获取匹配项：

preg_match_all('/<MY_TAG>(.*?)<\/MY_TAG>/s', $str, $matches)

要移除它们，使用preg_replace代替。

- Gumbo

1

@user187580：s标志使。匹配换行符。请参见http://php.net/manual/en/reference.pcre.pattern.modifiers.php - Gumbo

如果在字符串中发现该标签出现多次，最好使用ungreedy模式设置此模式。否则，您会发现将该字符串转换为“ This is line”。例如，将以下字符串：“This is <my_tag>a very</my_tag> important <my_tag>set</my_tag> line”转换为：“This is line” - Don

@Don 在 * 后面加上 ? 会起到同样的作用。 - Gumbo

我直接看了这个答案，但没有看到问号修饰符，哎呀！ - Don

2

对于这个问题，您不应该使用正则表达式。更好的解决方案是将内容加载到DOMDocument中，并使用DOM树和标准DOM方法进行操作：

$document = new DOMDocument();
$document->loadXML('<root/>');
$document->documentElement->appendChild(
    $document->createFragment($myTextWithTags));

$MY_TAGs = $document->getElementsByTagName('MY_TAG');
foreach($MY_TAGs as $MY_TAG)
{
    $xmlContent = $document->saveXML($MY_TAG);
    /* work on $xmlContent here */

    /* as a further example: */
    $ems = $MY_TAG->getElementsByTagName('em');
    foreach($ems as $em)
    {
        $emphazisedText = $em->nodeValue;
        /* do your operations here */
    }
}

- Kris

1

虽然唯一完全正确的方法是不使用正则表达式，但是如果你接受它无法处理所有特殊情况，仍然可以得到你想要的结果：

preg_match("/<em[^>]*?>.*?</em>/i", $str, $match);
// Use this only if you aren't worried about nested tags.
// It will handle tags with attributes

而

preg_replace(""/<MY_TAG[^>]*?>.*?</MY_TAG>/i", "", $str);

- Nicole

1

我测试了这个函数，它也适用于嵌套标签，使用 true/false 来排除/包含您的标签。在这里发现：https://www.php.net/manual/en/function.strip-tags.php

<?php
function strip_tags_content($text, $tags = '', $invert = FALSE) {

  preg_match_all('/<(.+?)[\s]*\/?[\s]*>/si', trim($tags), $tags);
  $tags = array_unique($tags[1]);
   
  if(is_array($tags) AND count($tags) > 0) {
    if($invert == FALSE) {
      return preg_replace('@<(?!(?:'. implode('|', $tags) .')\b)(\w+)\b.*?>.*?</\1>@si', '', $text);
    }
    else {
      return preg_replace('@<('. implode('|', $tags) .')\b.*?>.*?</\1>@si', '', $text);
    }
  }
  elseif($invert == FALSE) {
    return preg_replace('@<(\w+)\b.*?>.*?</\1>@si', '', $text);
  }
  return $text;
}




// Sample text:
$text = '<b>sample</b> text with <div>tags</div>';

// Result for:
echo strip_tags_content($text);
// text with

// Result for:
echo strip_tags_content($text, '<b>');
// <b>sample</b> text with

// Result for:
echo strip_tags_content($text, '<b>', TRUE);
// text with <div>tags</div>

- proseosoc

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- squarecandy · Accepted Answer

最终我使用了以下方法进行移除：

$str = preg_replace('~<MY_TAG(.*?)</MY_TAG>~Usi', "", $str);

使用 ~ 作为分隔符取代 /，可以解决因结束标签中的反斜杠而引发的错误，即使通过转义也似乎存在问题。在开始标签中删除 > 可以允许包含属性或其他字符，并仍获取标记及其所有内容。

这仅适用于嵌套不是一个问题的情况。

Usi 修饰符表示 U = 非贪婪模式，s = 包括换行符，i = 不区分大小写。