正则表达式和PHP - 从img标签中分离出src属性

Question

正则表达式和PHP - 从img标签中分离出src属性

42

使用 PHP，我如何从 $foo 中分离出 src 属性的内容？最终我想要的结果只是 "http://example.com/img/image.jpg"

$foo = '<img class="foo bar test" title="test image" src="http://example.com/img/image.jpg" alt="test image" width="100" height="100" />';

- Jeff

1

@meagar - 在这个有限的范围内，使用正则表达式是有效的（尽管不一定是最高效的方法）。 - John Parker

10

不要使用正则表达式来解析 HTML。（并非讽刺！） - Mark Byers

我在最初的帖子标题中说错了话，不应该添加正则表达式。我真的很喜欢karim79的解决方案，但它需要添加一个非标准类。 - Jeff

这个回答解决了你的问题吗？如何在PHP中解析和处理HTML/XML？ - TylerH

11个回答

40

代码

<?php
    $foo = '<img class="foo bar test" title="test image" src="http://example.com/img/image.jpg" alt="test image" width="100" height="100" />';
    $array = array();
    preg_match( '/src="([^"]*)"/i', $foo, $array ) ;
    print_r( $array[1] ) ;

输出

http://example.com/img/image.jpg

- St.Woland

请注意结果中的&实体引用和数字字符引用！ - bobince

1

随你的便！=）这里有一种替代语法：/src="(.*?)"/i。 - Alix Axel

HTML允许使用单引号，只要它们匹配即可。而且，“替代语法”可以匹配比预期更多的字符。最后，img属性可以在开头和结尾有空格。 - XedinUnknown

它应该是：/[sS][rR][cC]\s*=\s*['"]([^'"]+)['"]/i - jewelnguyen8

@jewel 为什么要制作不区分大小写的字符类，并在末尾编写不区分大小写的模式修饰符？这没有意义，会让模式变得臭气熏天。 - mickmackusa

9

I got this code:

$dom = new DOMDocument();
$dom->loadHTML($img);
echo $dom->getElementsByTagName('img')->item(0)->getAttribute('src');

假设只有一个图片 :P

- AntonioCS

7

// Create DOM from string
$html = str_get_html('<img class="foo bar test" title="test image" src="http://example.com/img/image.jpg" alt="test image" width="100" height="100" />');

// echo the src attribute
echo $html->find('img', 0)->src;

http://simplehtmldom.sourceforge.net/

- karim79

4

我非常晚才了解到这个问题，但我有一个简单的解决方案尚未提到。如果您启用了simplexml，请使用simplexml_load_string加载它，然后通过json_encode和json_decode进行转换。

$foo = '<img class="foo bar test" title="test image" src="http://example.com/img/image.jpg" alt="test image" width="100" height="100" />';

$parsedFoo = json_decode(json_encode(simplexml_load_string($foo)), true);
var_dump($parsedFoo['@attributes']['src']); // output: "http://example.com/img/image.jpg"

$parsedFoo 被解析为

array(1) {
  ["@attributes"]=>
  array(6) {
    ["class"]=>
    string(12) "foo bar test"
    ["title"]=>
    string(10) "test image"
    ["src"]=>
    string(32) "http://example.com/img/image.jpg"
    ["alt"]=>
    string(10) "test image"
    ["width"]=>
    string(3) "100"
    ["height"]=>
    string(3) "100"
  }
}

我已经使用它几个月来解析XML和HTML，效果很不错。目前还没有遇到过任何问题，尽管我还没有用它来解析大文件（我想使用json_encode和json_decode这样的方法会随着输入大小的增加而变慢）。虽然有些复杂，但这无疑是读取HTML属性最简单的方式。

- Josh

上周我发现了一个小问题。如果一个XML节点既有属性又有值，那么使用这种方法只能访问到值。最终我不得不编写一个简单的解析器，将simplexml转换为数组并保留所有数据。 - Josh

2

这是我最终采取的方法，但我不确定它有多高效：

$imgsplit = explode('"',$data);
foreach ($imgsplit as $item) {
    if (strpos($item, 'http') !== FALSE) {
        $image = $item;
        break;
    }
}

- Jeff

1

如果图像的URL相对于文档，例如"../../img/something.jpg"，那么这种方法会遇到问题。 - tomfumb

1

<?php
    $html = '
        <img border="0" src="/images/image1.jpg" alt="Image" width="100" height="100" />
        <img border="0" src="/images/image2.jpg" alt="Image" width="100" height="100" />
        <img border="0" src="/images/image3.jpg" alt="Image" width="100" height="100" />
        ';
    
    $get_Img_Src = '/<img[^>]*src=([\'"])(?<src>.+?)\1[^>]*>/i'; //for get img src path only...
    
    preg_match_all($get_Img_Src, $html, $result); 
    if (!empty($result)) {
        echo $result['src'][0];
        echo $result['src'][1];
    }

如果需要获取图片路径和alt文本 那么请使用下面的正则表达式，而不是上面的...

<img[^>]*src=(['"])(?.+?)\1[^>]alt=(['"])(?.+?)\2>

    $get_Img_Src = '/<img[^>]*src=([\'"])(?<src>.+?)\1[^>]*alt=([\'"])(?<alt>.+?)\2*>/i'; //for get img src path & alt text also
    
    preg_match_all($get_Img_Src, $html, $result); 
    if (!empty($result)) {
        echo $result['src'][0];
        echo $result['src'][1];
        echo $result['alt'][0];
        echo $result['alt'][1];
    }

我从这里，PHP抽取链接自href标记得到了这个伟大解决方案的想法。

如果只想提取特定域名的网址，请尝试以下正则表达式

// for e.g. if you need to extract onlt urls of "test.com" 
// then you can do it as like below regex

<a[^>]+href=([\'"])(?<href>(https?:\/\/)?test\.com.* ?)\1[^>]*>

- Harsh Patel

使用正则表达式解析有效的HTML是一种不必要的风险。 - mickmackusa

是的，但如果我们想要验证表单数据或操作HTML字符串，那么我们可以使用正则表达式进行抽象。我在我的项目中使用了上述正则表达式。这就是为什么我分享了一个独特的正则表达式解决方案来抽象src路径。 - Harsh Patel

我分享解决方案仅供学习目的。 - Harsh Patel

1

你可以使用以下函数解决这个问题：

function getTextBetween($start, $end, $text)
{
 $start_from = strpos($text, $start);
 $start_pos = $start_from + strlen($start);
 $end_pos = strpos($text, $end, $start_pos + 1);
 $subtext = substr($text, $start_pos, $end_pos);
 return $subtext;
}
$foo = '<img class="foo bar test" title="test image" 
src="http://example.com/img/image.jpg" alt="test image"
width="100" height="100" />';
$img_src = getTextBetween('src="', '"', $foo);

- Joel A. Villarreal Bertoldi

0

我使用preg_match_all来捕获HTML文档中的所有图像：

preg_match_all("~<img.*src\s*=\s*[\"']([^\"']+)[\"'][^>]*>~i", $body, $matches);

这个允许更加轻松的声明语法，可以使用不同类型的引号和空格。

正则表达式读起来像 <img （任何属性例如style或border）src（可能有空格）=（可能有空格）('或")（任何非引号符号）('或")（任何内容直到>）(>)。

- Mike

0

尝试一下这个模式：

'/< \s* img [^\>]* src \s* = \s* [\""\']? ( [^\""\'\s>]* )/'

- user256058

如果img被大写或标题包含“>”，这将无法工作。使用HTML解析器会更加健壮。 - Mark Byers

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- John Parker · Accepted Answer

如果您不想使用正则表达式（或任何非标准的PHP组件），则可以使用内置的DOMDocument类来实现合理的解决方案，如下所示：

<?php
    $doc = new DOMDocument();
    $doc->loadHTML('<img src="http://example.com/img/image.jpg" ... />');
    $imageTags = $doc->getElementsByTagName('img');

    foreach($imageTags as $tag) {
        echo $tag->getAttribute('src');
    }
?>