如何替换HTML标签中的文本URL并排除URL？

Question

如何替换HTML标签中的文本URL并排除URL？

13

我需要你的帮助。

我想将这个转换为：

sometext sometext http://www.somedomain.com/index.html sometext sometext

转化为：

sometext sometext <a href="http://somedoamai.com/index.html">www.somedomain.com/index.html</a> sometext sometext

我使用了以下正则表达式来完成：

preg_replace("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'<a href=\"$1\" target=\"_blank\">$1</a>$4'", $text);

问题在于它也替换了img的URL，例如：

sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext

被转换为：

sometext sometext <img src="<a href="http//domain.com/image.jpg">domain.com/image.jpg</a>"> sometext sometext

请帮忙。

- Andri

可能是Can you provide some examples of why it is hard to parse XML and HTML with a regex?的重复问题。 - Brad Mace

可能是RegEx匹配开放标签，除了XHTML自包含标签的重复问题。 - Paŭlo Ebermann

7个回答

4

不应该只使用正则表达式来完成此操作，而应该使用像PHP DOM库这样的适当的HTML DOM解析器。然后，您可以迭代节点，检查它是否为文本节点，并进行正则表达式搜索和替换文本节点。以下类似代码可实现此目的：

$pattern = "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i";
$doc = new DOMDocument();
$doc->loadHTML($str);
// for every element in the document
foreach ($doc->getElementsByTagName('*') as $elem) {
    // for every child node in each element
    foreach ($elem->childNodes as $node) {
        if ($node->nodeType === XML_TEXT_NODE) {
            // split the text content to get an array of 1+2*n elements for n URLs in it
            $parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
            $n = count($parts);
            if ($n > 1) {
                $parentNode = $node->parentNode;
                // insert for each pair of non-URL/URL parts one DOMText and DOMElement node before the original DOMText node
                for ($i=1; $i<$n; $i+=2) {
                    $a = $doc->createElement('a');
                    $a->setAttribute('href', $parts[$i]);
                    $a->setAttribute('target', '_blank');
                    $a->appendChild($doc->createTextNode($parts[$i]));
                    $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
                    $parentNode->insertBefore($a, $node);
                }
                // insert the last part before the original DOMText node
                $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
                // remove the original DOMText node
                $node->parentNode->removeChild($node);
            }
        }
    }
}

好的，由于DOMNodeList‍s的getElementsByTagName和childNodes是实时的，因此DOM中的每个更改都会反映到该列表中，因此您不能使用foreach来遍历新添加的节点。相反，您需要使用for循环，并跟踪已添加的元素以增加索引指针，并最好适当地预先计算数组边界。

但由于在这种有些复杂的算法中这样做相当困难（您需要为每个三个for循环中的一个索引指针和数组边界），因此使用递归算法更方便：

function mapOntoTextNodes(DOMNode $node, $callback) {
    if ($node->nodeType === XML_TEXT_NODE) {
        return $callback($node);
    }
    for ($i=0, $n=count($node->childNodes); $i<$n; ++$i) {
        $nodesChanged = 0;
        switch ($node->childNodes->item($i)->nodeType) {
            case XML_ELEMENT_NODE:
                $nodesChanged = mapOntoTextNodes($node->childNodes->item($i), $callback);
                break;
            case XML_TEXT_NODE:
                $nodesChanged = $callback($node->childNodes->item($i));
                break;
        }
        if ($nodesChanged !== 0) {
            $n += $nodesChanged;
            $i += $nodesChanged;
        }
    }
}
function foo(DOMText $node) {
    $pattern = "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i";
    $parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
    $n = count($parts);
    if ($n > 1) {
        $parentNode = $node->parentNode;
        $doc = $node->ownerDocument;
        for ($i=1; $i<$n; $i+=2) {
            $a = $doc->createElement('a');
            $a->setAttribute('href', $parts[$i]);
            $a->setAttribute('target', '_blank');
            $a->appendChild($doc->createTextNode($parts[$i]));
            $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
            $parentNode->insertBefore($a, $node);
        }
        $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
        $parentNode->removeChild($node);
    }
    return $n-1;
}

$str = '<div>sometext http://www.somedomain.com/index.html sometext <img src="http//domain.com/image.jpg"> sometext sometext</div>';
$doc = new DOMDocument();
$doc->loadHTML($str);
$elems = $doc->getElementsByTagName('body');
mapOntoTextNodes($elems->item(0), 'foo');

在这里，使用mapOntoTextNodes将给定的回调函数映射到DOM文档中的每个DOMText节点。您可以传递整个DOMDocument节点或只是特定的DOMNode（在这种情况下只有BODY节点）。

然后使用foo函数查找并替换DOMText节点内容中的纯URL，方法是使用preg_split将内容字符串分割成非URL / URL部分，并捕获使用的分隔符，从而得到1 + 2·n个项目的数组。然后用新的DOMText节点替换非URL部分，用新的A元素替换URL部分，然后在结尾处删除原来的DOMText节点。由于此mapOntoTextNodes递归地遍历，因此只需在特定DOMNode上调用该函数即可。

- Gumbo

谢谢回答，但我需要使用正则表达式，因为它比使用多个函数更轻便快速。 - Andri

6

@Andri: 但是使用正则表达式可能会得到意外的结果，因为HTML是一种不规则的语言。 - Gumbo

1

谢谢您的回复，但它仍然不起作用。我使用了这个函数进行修复：

function livelinked ($text){
        preg_match_all("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)|^(jpg)#ie", $text, $ccs);
        foreach ($ccs[3] as $cc) {
           if (strpos($cc,"jpg")==false  && strpos($cc,"gif")==false && strpos($cc,"png")==false ) {
              $old[] = "http://".$cc;
              $new[] = '<a href="http://'.$cc.'" target="_blank">'.$cc.'</a>';
           }
        }
        return str_replace($old,$new,$text);
}

- Andri

0

匹配 URL 字符串的开头和结尾处的空格 (\s)，这将确保

"http://url.com"

不匹配

http://url.com

匹配成功;

- stone

0

DomDocument更加成熟且运行速度更快，因此它只是一个替代选择，如果有人想使用PHP Simple HTML DOM解析器：

<?php
require_once('simple_html_dom.php');

$html = str_get_html('sometext sometext http://www.somedomain.com/index.html sometext sometext
<a href="http://www.somedomain.com/index.html">http://www.somedomain.com/index.html</a>
sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext');

foreach ($html->find('text') as $element)
{
    // you can add any tag into the array to exclude from replace
    if (!in_array($element->parent()->tag, array('a')))
        $element->innertext = preg_replace("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'<a href=\"$1\" target=\"_blank\">$1</a>$4'", $element->innertext);
}

echo $html;

- István Ujj-Mészáros

1

建议使用以下第三方替代方案来代替 SimpleHtmlDom，这些方案实际上是使用 DOM 而不是字符串解析：phpQuery、Zend_Dom、QueryPath 和 FluentDom。 - Gordon

0

你可以从这个问题中尝试我的代码：

echo preg_replace('/<a href="([^"]*)([^<\/]*)<\/a>/i', "$1", 'sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext');

如果您想转换其他标签-那很容易：

echo preg_replace('/<img src="([^"]*)([^\/><]*)>/i', "$1", 'sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext');

- shybovycha

0

如果您想继续使用正则表达式（在这种情况下，正则表达式非常适用），您可以使正则表达式仅匹配“独立”的URL。使用单词边界转义序列（\b），您只能使正则表达式在空格或文本开头紧接着http的位置进行匹配：

preg_replace("#\b((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'<a href=\"$1\" target=\"_blank\">$1</a>$4'", $text);
            // ^^ thar she blows

因此，"http://..."不会匹配，但http://作为自己的单词将会匹配。

- kevingessner

1

它也不会匹配句子末尾的任何URL，例如跟随句号或逗号分隔的枚举等。不用说，在HTML属性中甚至不需要引号。 - Gordon

1

单词边界的描述也是不正确的。如此使用\b，只会断言http、https或ftp之前没有紧接着字母、数字或下划线。它会在"http或=http中的h之前匹配，因此它并不能像你所声称的那样防止属性值中的匹配。 - Alan Moore

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Gordon · Accepted Answer

Gumbo上述版本的简化版：

$html = <<< HTML
<html>
<body>
<p>
    This is a text with a <a href="http://example.com/1">link</a>
    and another <a href="http://example.com/2">http://example.com/2</a>
    and also another http://example.com with the latter being the
    only one that should be replaced. There is also images in this
    text, like <img src="http://example.com/foo"/> but these should
    not be replaced either. In fact, only URLs in text that is no
    a descendant of an anchor element should be converted to a link.
</p>
</body>
</html>
HTML;

让我们使用一个XPath，只获取那些实际上是文本节点的http://或https://或ftp://，并且不是锚元素本身的文本节点。

$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$texts = $xPath->query(
    '/html/body//text()[
        not(ancestor::a) and (
        contains(.,"http://") or
        contains(.,"https://") or
        contains(.,"ftp://") )]'
);

上面的XPath将给我们一个带有以下数据的TextNode：

 and also another http://example.com with the latter being the
    only one that should be replaced. There is also images in this
    text, like

自 PHP5.3 以来，我们也可以使用 XPath 中的 PHP 来使用正则表达式模式选择节点，而不是三个 contains 调用。

与以标准兼容的方式拆分文本节点不同，我们将使用文档片段，并只用片段替换整个文本节点。在这种情况下，“非标准”仅意味着我们将使用的方法不是 DOM API 的 W3C 规范的一部分。

foreach ($texts as $text) {
    $fragment = $dom->createDocumentFragment();
    $fragment->appendXML(
        preg_replace(
            "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i",
            '<a href="$1">$1</a>',
            $text->data
        )
    );
    $text->parentNode->replaceChild($fragment, $text);
}
echo $dom->saveXML($dom->documentElement);

然后这将输出：

<html><body>
<p>
    This is a text with a <a href="http://example.com/1">link</a>
    and another <a href="http://example.com/2">http://example.com/2</a>
    and also another <a href="http://example.com">http://example.com</a> with the latter being the
    only one that should be replaced. There is also images in this
    text, like <img src="http://example.com/foo"/> but these should
    not be replaced either. In fact, only URLs in text that is no
    a descendant of an anchor element should be converted to a link.
</p>
</body></html>