使用PHP从URL获取"property og"元标记

8
我希望创建一个类似Facebook使用的发布功能(您将链接粘贴到文本框中,点击发布按钮,它会发布标题、说明和图像)。我意识到最好提取具有og属性的元标记,如“og:title”和“og:image”,因为如果我使用普通的标记,有时它们会有换行符等其他问题,导致出现错误。
是否有一种使用PHP获取这些标记内容的方法,但不需要使用AJAX或其他自定义解析器?起点将是:
<?php

$url = $_POST['link'];

?>

我们通过POST方法从之前的页面获取URL,但是接下来该如何处理呢?

5个回答

9
解决方案如下:
libxml_use_internal_errors(true);
$c = file_get_contents("http://url/here");
$d = new DomDocument();
$d->loadHTML($c);
$xp = new domxpath($d);
foreach ($xp->query("//meta[@property='og:title']") as $el) {
    echo $el->getAttribute("content");
}
foreach ($xp->query("//meta[@property='og:description']") as $el) {
    echo $el->getAttribute("content");
}

这是XPath,一次就做好,不要重复://meta[@property='og:title' or @property='og:description']/@content - hakre
如果文档无效,则会引发异常,我将使用简单的正则表达式来获取它,而不是解析整个文档。 - Rosmarine Popcorn
这个完美地运行了,正是我一直在寻找的! - user3025039
有些人遇到问题时,会想:“我知道,我可以使用正则表达式。”现在他们有两个问题了。 - Jon Winstanley

5

使用以下类似代码:

libxml_use_internal_errors(true); // Yeah if you are so worried about using @ with warnings
$doc = new DomDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = '//*/meta[starts-with(@property, \'og:\')]';
$metas = $xpath->query($query);
foreach ($metas as $meta) {
    $property = $meta->getAttribute('property');
    $content = $meta->getAttribute('content');
    $rmetas[$property] = $content;
}
var_dump($rmetas);

我在如何通过php获取网页的Open Graph协议上发现了这个 - 搜索很有帮助,Google也是。 http://www.google.co.uk/search?q=meta+property+og+tags

我应该把URL变量放在哪里?$html还是$doc? - Jakov
我认为 $html ;) $doc 是面向对象编程。 - MrJ
它不断地给我“C:\xampp\htdocs\linkedit\index.php on line 73中未定义的变量:rmetas NULL”。第73行是var_dump($ rmetas)。 - Jakov
没关系,我找到了,最好的方法类似于你发布的那个,可以在这里找到[链接](http://stackoverflow.com/questions/2273555/codeigniter-a-class-library-to-help-get-meta-tags-from-a-web-page)上Artefacto的帖子(第4个答案),唯一的问题是你必须将name ='keywords'更改为property ='og:title',这样它就能完美地工作了。 - Jakov
1
如果有人遇到了你的问题,可以随意勾选接受答案按钮或将你的解决方案发布为答案以供将来参考 :) - MrJ

3

这并不总是有效。它在我的 Twitter 页面上未返回任何标签。 - oknate

1
我们使用php中的Apache Tika(命令行实用程序)并带有-j选项进行json转换:

http://tika.apache.org/

<?php
    shell_exec( 'java -jar tika-app-1.4.jar -j http://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying' );
?>

这是一个随机《卫报》文章的样本输出
{
   "Content-Encoding":"UTF-8",
   "Content-Length":205599,
   "Content-Type":"text/html; charset\u003dUTF-8",
   "DC.date.issued":"2013-07-21",
   "X-UA-Compatible":"IE\u003dEdge,chrome\u003d1",
   "application-name":"The Guardian",
   "article:author":"http://www.guardian.co.uk/profile/nicholaswatt",
   "article:modified_time":"2013-07-21T22:42:21+01:00",
   "article:published_time":"2013-07-21T22:00:03+01:00",
   "article:section":"Politics",
   "article:tag":[
      "Lynton Crosby",
      "Health policy",
      "NHS",
      "Health",
      "Healthcare industry",
      "Society",
      "Public services policy",
      "Lobbying",
      "Conservatives",
      "David Cameron",
      "Politics",
      "UK news",
      "Business"
   ],
   "content-id":"/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
   "dc:title":"Tory strategist Lynton Crosby in new lobbying row | Politics | The Guardian",
   "description":"Exclusive: Firm he founded, Crosby Textor, advised private healthcare providers how to exploit NHS \u0027failings\u0027",
   "fb:app_id":180444840287,
   "keywords":"Lynton Crosby,Health policy,NHS,Health,Healthcare industry,Society,Public services policy,Lobbying,Conservatives,David Cameron,Politics,UK news,Business,Politics",
   "msapplication-TileColor":"#004983",
   "msapplication-TileImage":"http://static.guim.co.uk/static/a314d63c616d4a06f5ec28ab4fa878a11a692a2a/common/images/favicons/windows_tile_144_b.png",
   "news_keywords":"Lynton Crosby,Health policy,NHS,Health,Healthcare industry,Society,Public services policy,Lobbying,Conservatives,David Cameron,Politics,UK news,Business,Politics",
   "og:description":"Exclusive: Firm he founded, Crosby Textor, advised private healthcare providers how to exploit NHS \u0027failings\u0027",
   "og:image":"https://static-secure.guim.co.uk/sys-images/Guardian/Pix/pixies/2013/7/21/1374433351329/Lynton-Crosby-008.jpg",
   "og:site_name":"the Guardian",
   "og:title":"Tory strategist Lynton Crosby in new lobbying row",
   "og:type":"article",
   "og:url":"http://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
   "resourceName":"tory-strategist-lynton-crosby-lobbying",
   "title":"Tory strategist Lynton Crosby in new lobbying row | Politics | The Guardian",
   "twitter:app:id:googleplay":"com.guardian",
   "twitter:app:id:iphone":409128287,
   "twitter:app:name:googleplay":"The Guardian",
   "twitter:app:name:iphone":"The Guardian",
   "twitter:app:url:googleplay":"guardian://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
   "twitter:card":"summary_large_image",
   "twitter:site":"@guardian"
}

0

试试这个.. 对我有用..

foreach($linkHtml->find('head meta[property=og:url]') as $url)
{   
    echo $url->content.'</br>';
}

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接