从外部网站获取标题和元标记

68
我想尝试弄清楚如何获取
<title>A common title</title>
<meta name="keywords" content="Keywords blabla" />
<meta name="description" content="This is the description" />
即使以任何顺序排列,我已经听说过PHP Simple HTML DOM解析器,但我不想使用它。除了使用PHP Simple HTML DOM解析器之外,是否有其他解决方案?
如果是无效的HTML,preg_match将无法完成这项工作?
cURL能否像preg_match一样做这样的事情?
Facebook做了类似的事情,但是它通过正确使用来实现:
<meta property="og:description" content="Description blabla" />

我希望有这样一个功能,当有人发表一个链接时,它可以获取该链接的标题和元标签。如果没有元标签,则忽略或让用户自己设置(但我以后会自己处理)。

22个回答

3
我们使用Apache Tika通过php(命令行实用程序)并带有-j选项用于json:

http://tika.apache.org/

<?php
    shell_exec( 'java -jar tika-app-1.4.jar -j http://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying' );
?>

这是一篇随机《卫报》文章的示例输出
{
   "Content-Encoding":"UTF-8",
   "Content-Length":205599,
   "Content-Type":"text/html; charset\u003dUTF-8",
   "DC.date.issued":"2013-07-21",
   "X-UA-Compatible":"IE\u003dEdge,chrome\u003d1",
   "application-name":"The Guardian",
   "article:author":"http://www.guardian.co.uk/profile/nicholaswatt",
   "article:modified_time":"2013-07-21T22:42:21+01:00",
   "article:published_time":"2013-07-21T22:00:03+01:00",
   "article:section":"Politics",
   "article:tag":[
      "Lynton Crosby",
      "Health policy",
      "NHS",
      "Health",
      "Healthcare industry",
      "Society",
      "Public services policy",
      "Lobbying",
      "Conservatives",
      "David Cameron",
      "Politics",
      "UK news",
      "Business"
   ],
   "content-id":"/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
   "dc:title":"Tory strategist Lynton Crosby in new lobbying row | Politics | The Guardian",
   "description":"Exclusive: Firm he founded, Crosby Textor, advised private healthcare providers how to exploit NHS \u0027failings\u0027",
   "fb:app_id":180444840287,
   "keywords":"Lynton Crosby,Health policy,NHS,Health,Healthcare industry,Society,Public services policy,Lobbying,Conservatives,David Cameron,Politics,UK news,Business,Politics",
   "msapplication-TileColor":"#004983",
   "msapplication-TileImage":"http://static.guim.co.uk/static/a314d63c616d4a06f5ec28ab4fa878a11a692a2a/common/images/favicons/windows_tile_144_b.png",
   "news_keywords":"Lynton Crosby,Health policy,NHS,Health,Healthcare industry,Society,Public services policy,Lobbying,Conservatives,David Cameron,Politics,UK news,Business,Politics",
   "og:description":"Exclusive: Firm he founded, Crosby Textor, advised private healthcare providers how to exploit NHS \u0027failings\u0027",
   "og:image":"https://static-secure.guim.co.uk/sys-images/Guardian/Pix/pixies/2013/7/21/1374433351329/Lynton-Crosby-008.jpg",
   "og:site_name":"the Guardian",
   "og:title":"Tory strategist Lynton Crosby in new lobbying row",
   "og:type":"article",
   "og:url":"http://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
   "resourceName":"tory-strategist-lynton-crosby-lobbying",
   "title":"Tory strategist Lynton Crosby in new lobbying row | Politics | The Guardian",
   "twitter:app:id:googleplay":"com.guardian",
   "twitter:app:id:iphone":409128287,
   "twitter:app:name:googleplay":"The Guardian",
   "twitter:app:name:iphone":"The Guardian",
   "twitter:app:url:googleplay":"guardian://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
   "twitter:card":"summary_large_image",
   "twitter:site":"@guardian"
}

共享或基本托管不支持在其服务器上运行Java :) - Ravinder Payal

2

2

从url获取meta标签,php函数示例:

function get_meta_tags ($url){
         $html = load_content ($url,false,"");
         print_r ($html);
         preg_match_all ("/<title>(.*)<\/title>/", $html["content"], $title);
         preg_match_all ("/<meta name=\"description\" content=\"(.*)\"\/>/i", $html["content"], $description);
         preg_match_all ("/<meta name=\"keywords\" content=\"(.*)\"\/>/i", $html["content"], $keywords);
         $res["content"] = @array("title" => $title[1][0], "descritpion" => $description[1][0], "keywords" =>  $keywords[1][0]);
         $res["msg"] = $html["msg"];
         return $res;
}

例子:

print_r (get_meta_tags ("bing.com") );

获取PHP中的元标签


1
不要忘记 namecontent 属性可能以不同的顺序出现。 - MacMac
也不要忘记单引号可以替换双引号使用。 - SlickRemix

1
现在,大多数网站都会添加元标记到他们的网站上,提供关于他们的网站或任何特定文章页面的信息,例如新闻或博客网站。
我创建了一个Meta API,可以给你所需的元数据,如OpenGraph、Schema.Org等。
快来看看吧 - https://api.sakiv.com/docs

1
如果你正在使用PHP,请查看pear.php.net上的Pear包,看看是否有任何对你有用的东西。我已经有效地使用了RSS包,它可以节省很多时间,只要你能够通过他们的示例来实现代码。
具体来说,请查看Sax 3,看看它是否适合你的需求。Sax 3不再更新,但可能已经足够。

1

正如已经所述,这可以解决问题:

$url='https://dev59.com/iHA65IYBdhLWcg3wogOe#4640613';
$meta=get_meta_tags($url);
echo $title=$meta['title'];

//php - Get Title and Meta Tags of External site - Stack Overflow

1

我用了一种不同的方法让它工作了,并且想分享一下。代码比其他人少,可以在这里找到。 我添加了一些内容,使其可以自动加载您所在页面的元信息而不是某个页面。我希望这能够自动将默认页面标题和描述复制到og标记中。

但是由于某种原因,无论我尝试了什么方法(不同的脚本),页面在线上加载非常缓慢,但在wamp上却立即显示。 不确定为什么,所以我可能会使用一个switch case,因为该站点不是很庞大。

<?php
$url = 'http://sitename.com'.$_SERVER['REQUEST_URI'];
$fp = fopen($url, 'r');

$content = "";

while(!feof($fp)) {
    $buffer = trim(fgets($fp, 4096));
    $content .= $buffer;
}

$start = '<title>';
$end = '<\/title>';

preg_match("/$start(.*)$end/s", $content, $match);
$title = $match[1];

$metatagarray = get_meta_tags($url);
$description = $metatagarray["description"];

echo "<div><strong>Title:</strong> $title</div>";
echo "<div><strong>Description:</strong> $description</div>";
?>

在HTML头部。
<meta property="og:title" content="<?php echo $title; ?>" />
<meta property="og:description" content="<?php echo $description; ?>" />

1
<?php 

// ------------------------------------------------------ 

function curl_get_contents($url) {

    $timeout = 5; 
    $useragent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko/20100101 Firefox/27.0'; 

    $ch = curl_init(); 
    curl_setopt($ch, CURLOPT_URL, $url); 
    curl_setopt($ch, CURLOPT_USERAGENT, $useragent); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); 
    $data = curl_exec($ch); 
    curl_close($ch); 

    return $data; 
}

// ------------------------------------------------------ 

function fetch_meta_tags($url) { 

    $html = curl_get_contents($url); 
    $mdata = array(); 

    $doc = new DOMDocument();
    $doc->loadHTML($html);

    $titlenode = $doc->getElementsByTagName('title'); 
    $title = $titlenode->item(0)->nodeValue;

    $metanodes = $doc->getElementsByTagName('meta'); 
    foreach($metanodes as $node) { 
    $key = $node->getAttribute('name'); 
    $val = $node->getAttribute('content'); 
    if (!empty($key)) { $mdata[$key] = $val; } 
    }

    $res = array($url, $title, $mdata); 

    return $res;
}

// ------------------------------------------------------ 

?>

1

改进了@shamittomar的答案,以获取元标签(或来自HTML源的指定元标签)

可以进一步改进...与php的默认get_meta_tags的区别在于它可以在存在Unicode字符串时工作

function getMetaTags($html, $name = null)
{
    $doc = new DOMDocument();
    try {
        @$doc->loadHTML($html);
    } catch (Exception $e) {

    }

    $metas = $doc->getElementsByTagName('meta');

    $data = [];
    for ($i = 0; $i < $metas->length; $i++)
    {
        $meta = $metas->item($i);

        if (!empty($meta->getAttribute('name'))) {
            // will ignore repeating meta tags !!
            $data[$meta->getAttribute('name')] = $meta->getAttribute('content');
        }
    }

    if (!empty($name)) {
        return !empty($data[$name]) ? $data[$name] : false;
    }

    return $data;
}

1

我基于最佳答案https://github.com/diversen/get-meta-tags创建了这个小的composer包。

composer require diversen/get-meta-tags

然后:

use diversen\meta;

$m = new meta();

// Simple usage, get's title, description, and keywords by default
$ary = $m->getMeta('https://github.com/diversen/get-meta-tags');
print_r($ary);

// With more params
$ary = $m->getMeta('https://github.com/diversen/get-meta-tags', array ('description' ,'keywords'), $timeout = 10);
print_r($ary);

它需要CURL和DOMDocument,就像顶部答案一样,并以此方式构建,但具有设置CURL超时的选项(以及获取所有种类的元标记)。


1
顺便提一下:将类名以大写字母开头是良好(且常见)的编程习惯,因此class meta ..应改为class Meta....。您可能也想要遵循这里广泛采用的模式。 - Marcin Orlowski
@MarcinOrlowski 您是对的。最佳实践和所有这些。坏习惯。:) - dennis

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接