Node JS抓取HTML字符串中的第一张图片

Question

Node JS抓取HTML字符串中的第一张图片

5

我正在尝试从这样的HTML字符串中获取第一个图像：

...

  <table border="0" cellpadding="2" cellspacing="7" style="vertical-align:top;"><tr><td width="80" align="center" valign="top"><font style="font-size:85%;font-family:arial,sans-serif"><a href="http://news.google.com/news/url?sa=t&amp;fd=R&amp;ct2=us&amp;usg=AFQjCNFfn6RXQ3v898sGY_-sFLGCJ4EV5Q&amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;cid=52778551504048&amp;ei=zfK5U7D4JoLi1Ab0wIHwDw&amp;url=http://online.wsj.com/articles/obamas-letters-to-corinthian-1404684555"><img src="//t3.gstatic.com/images?q=tbn:ANd9GcQVyQsQJvKMgXHEX9riJuZKWav5U1nI-jdB-i1HwFYQ-7jGvGrbk9N_k0XEDMVH-HAbLxP1wrU" alt="" border="1" width="80" height="80" /><br /><font size="-2">Wall Street Journal</font></a></font></td><td valign="top" class="j"><font style="font-size:85%;font-family:arial,sans-serif"><br /><div style="padding-top:0.8em;"><img alt="" height="1" width="1" /></div><div class="lh"><a href="http://news.google.com/news/url?sa=t&amp;fd=R&amp;ct2=us&amp;usg=AFQjCNFfn6RXQ3v898sGY_-sFLGCJ4EV5Q&amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;cid=52778551504048&amp;ei=zfK5U7D4JoLi1Ab0wIHwDw&amp;url=http://online.wsj.com/articles/obamas-letters-to-corinthian-1404684555"><b><b>Obama&#39;s</b> Letters to Corinthian</b></a><br /><font size="-1"><b><font color="#6f6f6f">Wall Street Journal</font></b></font><br /><font size="-1">The <b>Obama</b> Administration has targeted for-profit colleges as if they are enemy combatants. And now it has succeeded in putting out of business Santa Ana-based Corinthian Colleges for a dilatory response to document requests. Does the White House plan&nbsp;...</font><br /><font size="-1" class="p"></font><br /><font class="p" size="-1"><a class="p" href="http://news.google.com/news/more?ncl=dPkBozywrsIXKoM&amp;authuser=0&amp;ned=us"><nobr><b>and more&nbsp;&raquo;</b></nobr></a></font></div></font></td></tr></table>

这是图片的标签

<img src="//t3.gstatic.com/images?q=tbn:ANd9GcQVyQsQJvKMgXHEX9riJuZKWav5U1nI-jdB-i1HwFYQ-7jGvGrbk9N_k0XEDMVH-HAbLxP1wrU" alt="" border="1" width="80" height="80">

每个图像都有这种类型的URL： //tx.gstatic.com，其中x是一个数字，我认为在0<x<3之间。

这就是我一直尝试做的事情，但没有成功，我不明白为什么会发生这种情况。

      var re = /<img[^>]+src="?([^"\s]+)"?\s*\/>/g;
      var results = re.exec(HTMLSTRING);
      var img="";
      if(results!=null && results.length!=0) img = results[0];

- Usi Usi

为什么会发生这种情况？请解释一下问题的具体原因。 - Amadan

结果[0]为空，我认为正则表达式无效。 - Usi Usi

您正在尝试匹配到结尾的\>，但是在src值的末尾之后，您只允许空格字符。看起来您实际上并不需要一直匹配到\>的结尾。 - cookie monster

而且，顺便提一下，如果 results 不是 null，那么它的 .length 将不为 0，虽然如果你想要 src 值，则似乎需要索引 [1]。此外，你的方法取决于属性的顺序、标签名称的小写以及使用双引号而不是单引号。只是想指出这一点。 - cookie monster

https://dev59.com/X3I-5IYBdhLWcg3wq6do - rgajrawala

2个回答

0

你可以使用jQuery NPM模块并执行以下操作：

var jQuery = require('jquery');

try {
    var src = jQuery('YOUR_HTML_STRING').find('img')[0].src;
    console.log('Output:\nSrc: ' + src + '\nNum: ' + (src.match(/\/\/t[0-3]/)[0])[3]);
} catch (e) {
    console.log('Could not find <img>!');
}

- rgajrawala

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Amadan · Accepted Answer

您提供的正则表达式确实不够通用，无法捕获您的<img>标记。

有两个选择：

Make a better regular expression. This way lies madness. But in this case, it is sufficient to add the possibility of other attributes after src:
```
var re = /<img[^>]+src="?([^"\s]+)"?[^>]*\/>/g;
var results = re.exec(HTMLSTRING);
var img="";
if(results) img = results[1];
```
Note [^>]* replacing your \s*, and also note results[1] instead of results[0] if you want the source and not the tag itself.

Use a DOM parser to handle DOM. This is the easy path.

var jsdom = require("jsdom");
var img_sources = jsdom.env(
  HTMLSTRING,
  function (errors, window) {
    var imgs = window.document.getElementsByTagName('img');
    for (var i = 0; i < imgs.length; i++) {
      var src = imgs[i].getAttribute('src');
      if (src) console.log(src);
    }
  }
);