PHP正则表达式：去除HTML文档中的标签

Question

PHP正则表达式：去除HTML文档中的标签

7

假设我有以下文本：

..(content).............
<A HREF="http://foo.com/content" >blah blah blah </A>
...(continue content)...

我想删除链接并删除 `` 标签（同时保留中间的文本）。如何使用正则表达式来实现这一点（因为URL都不同）？

非常感谢。

- Señor Reginold Francis

可能是你能提供一些在使用正则表达式解析XML和HTML时的难点示例吗？的重复问题。 - Brad Mace

1

可能是RegEx匹配开放标签，除了XHTML自包含标签的重复问题。 - Paŭlo Ebermann

8个回答

16

尽可能避免使用正则表达式，特别是在处理xml时。在这种情况下，您可以使用strip_tags()或simplexml，具体取决于您的字符串。

- soulmerge

4

<?php
//example to extract the innerText from all anchors in a string
include('simple_html_dom.php');

$html = str_get_html('<A HREF="http://foo.com/content" >blah blah blah </A><A HREF="http://foo.com/content" >blah blah blah </A>');

//print the text of each anchor    
foreach($html->find('a') as $e) {
    echo $e->innerText;
}
?>

请查看PHP Simple DOM解析器。

- karim79

3

不太美观但能胜任任务：

$data = str_replace('</a>', '', $data);
$data = preg_replace('/<a[^>]+href[^>]+>/', '', $data);

- Rufinus

strip_tags在HTML格式正确的情况下运行良好。我遇到了一个HTML文件的问题，其中属性缺少引号，这种方法很有效。谢谢！ - FrancescoR

1

strip_tags() 也可以使用。

请参见此处的示例。

- MIV1987

1

欢迎来到Stack Overflow！虽然这可能回答了问题，但最好在此处包含答案的必要部分，并提供参考链接。 - senderle

@senderle，我基本上同意你的观点，但这次不是“任何”外部页面，而是PHP.net的官方页面，其中描述了strip_tag函数，并且在这里复制代码示例并不必要;) 这个答案已经包含了函数名称及其链接参考。 - Wh1T3h4Ck5

1

$pattern = '/href="([^"]*)"/';

- Paulo Peres Junior

0

我使用这个来将锚替换为文本字符串...

function replaceAnchorsWithText($data) {
        $regex  = '/(<a\s*'; // Start of anchor tag
        $regex .= '(.*?)\s*'; // Any attributes or spaces that may or may not exist
        $regex .= 'href=[\'"]+?\s*(?P<link>\S+)\s*[\'"]+?'; // Grab the link
        $regex .= '\s*(.*?)\s*>\s*'; // Any attributes or spaces that may or may not exist before closing tag
        $regex .= '(?P<name>\S+)'; // Grab the name
        $regex .= '\s*<\/a>)/i'; // Any number of spaces between the closing anchor tag (case insensitive)

        if (is_array($data)) {
            // This is what will replace the link (modify to you liking)
            $data = "{$data['name']}({$data['link']})";
        }
        return preg_replace_callback($regex, array('self', 'replaceAnchorsWithText'), $data);
    }

- SoN9ne

-2

使用 str_replace。

- nandocurty

他应该如何使用不同的href字符串来实现这个？ - Rufinus

（我不是下投票者，但看起来他不会解释为什么他投了反对票，这并不是很有帮助，我可以补充一下，让我们猜猜为什么...）使用str_replace时，无法指定“模式”，这是一个问题，因为URL可能会更改；即使它没有更改，您也必须使用两个调用str_replace：一个用于开放标记，另一个用于关闭标记，因为您想保留之间的内容。 - Pascal MARTIN

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- nickf · Accepted Answer

这将删除所有标签：

preg_replace("/<.*?>/", "", $string);

这将仅删除<a>标记：

preg_replace("/<\\/?a(\\s+.*?>|>)/", "", $string);