从外部网站获取标题和元标记

Question

从外部网站获取标题和元标记

68

我想尝试弄清楚如何获取

<title>A common title</title>
<meta name="keywords" content="Keywords blabla" />
<meta name="description" content="This is the description" />

即使以任何顺序排列，我已经听说过PHP Simple HTML DOM解析器，但我不想使用它。除了使用PHP Simple HTML DOM解析器之外，是否有其他解决方案？

如果是无效的HTML，preg_match将无法完成这项工作？

cURL能否像preg_match一样做这样的事情？

Facebook做了类似的事情，但是它通过正确使用来实现：

<meta property="og:description" content="Description blabla" />

我希望有这样一个功能，当有人发表一个链接时，它可以获取该链接的标题和元标签。如果没有元标签，则忽略或让用户自己设置（但我以后会自己处理）。

- MacMac

22个回答

42

<?php
// Assuming the above tags are at www.example.com
$tags = get_meta_tags('http://www.example.com/');

// Notice how the keys are all lowercase now, and
// how . was replaced by _ in the key.
echo $tags['author'];       // name
echo $tags['keywords'];     // php documentation
echo $tags['description'];  // a php manual
echo $tags['geo_position']; // 49.33;-86.59
?>

- Bob Jeey

5

虽然这并未提供页面标题，但可以翻译为“这并没有提供页面标题”。 - Steel Brain

3

因为它是与元标签相关的功能，它怎么可能返回与其无关的东西呢。 - Nishant Ghodke

大家好，你们有没有想法为什么这个示例不能与 Facebook 的公共粉丝页面一起使用？我无法从任何 Facebook 页面中读取元标记。例如：https://www.facebook.com/MeineNanny 当我使用 https://metatags.io/ 检查时，我可以看到元描述，但是当我尝试用 PHP 读取时，无法获取它们！有什么想法吗？ - Mitch

11

get_meta_tags方法可以帮助你获取除标题之外的所有元标记信息。要获取标题，只需使用正则表达式即可。

$url = 'http://some.url.com';
preg_match("/<title>(.+)<\/title>/siU", file_get_contents($url), $matches);
$title = $matches[1];

希望这能有所帮助。

- Lloyd Moore

8

get_meta_tags不能获取标题。

只有像 name 属性一样的 meta 标签才能被获取到。

<meta name="description" content="the description">

将被解析。

- Harald

6

Php的原生函数: get_meta_tags()

http://php.net/manual/zh/function.get-meta-tags.php

这个函数可以读取指定 URL 中的 HTML 元数据，并将其作为数组返回，数组中包含了元数据的各种信息。

- Addo Solutions

6

我们不应该使用OG吗？

虽然已经有一个好的选择答案，但它在网站被重定向时（非常普遍！）不起作用，并且没有返回OG标签，这是新的行业标准。下面是一个更加适用于2018年的小函数。它尝试获取OG标签，如果无法获取则回退到meta标签：

function getSiteOG( $url, $specificTags=0 ){
    $doc = new DOMDocument();
    @$doc->loadHTML(file_get_contents($url));
    $res['title'] = $doc->getElementsByTagName('title')->item(0)->nodeValue;

    foreach ($doc->getElementsByTagName('meta') as $m){
        $tag = $m->getAttribute('name') ?: $m->getAttribute('property');
        if(in_array($tag,['description','keywords']) || strpos($tag,'og:')===0) $res[str_replace('og:','',$tag)] = $m->getAttribute('content');
    }
    return $specificTags? array_intersect_key( $res, array_flip($specificTags) ) : $res;
}

如何使用：

/////////////
//SAMPLE USAGE:
print_r(getSiteOG("http://www.stackoverflow.com")); //note the incorrect url

/////////////
//OUTPUT:
Array
(
    [title] => Stack Overflow - Where Developers Learn, Share, & Build Careers
    [description] => Stack Overflow is the largest, most trusted online community for developers to learn, shareâ âtheir programming âknowledge, and build their careers.
    [type] => website
    [url] => https://stackoverflow.com/
    [site_name] => Stack Overflow
    [image] => https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded
)

- cronoklee

5

不幸的是，内置的php函数get_meta_tags()需要name参数，而某些网站（如Twitter）会放弃使用name属性而改用property属性。此函数将使用正则表达式和DOM文档的混合方式，从网页返回一个键控的metatags数组。它首先检查name参数，然后检查property参数。这已在instragram、Pinterest和Twitter上进行了测试。

/**
 * Extract metatags from a webpage
 */
function extract_tags_from_url($url) {
  $tags = array();

  $ch = curl_init();
  curl_setopt($ch, CURLOPT_HEADER, 0);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

  $contents = curl_exec($ch);
  curl_close($ch);

  if (empty($contents)) {
    return $tags;
  }

  if (preg_match_all('/<meta([^>]+)content="([^>]+)>/', $contents, $matches)) {
    $doc = new DOMDocument();
    $doc->loadHTML('<?xml encoding="utf-8" ?>' . implode($matches[0]));
    $tags = array();
    foreach($doc->getElementsByTagName('meta') as $metaTag) {
      if($metaTag->getAttribute('name') != "") {
        $tags[$metaTag->getAttribute('name')] = $metaTag->getAttribute('content');
      }
      elseif ($metaTag->getAttribute('property') != "") {
        $tags[$metaTag->getAttribute('property')] = $metaTag->getAttribute('content');
      }
    }
  }

  return $tags;
}

- oknate

5

一个简单的函数，用于了解如何检索og:tags、标题和描述信息，请根据自己的需求进行适当修改。

function read_og_tags_as_json($url){


    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    $HTML_DOCUMENT = curl_exec($ch);
    curl_close($ch);

    $doc = new DOMDocument();
    $doc->loadHTML($HTML_DOCUMENT);

    // fecth <title>
    $res['title'] = $doc->getElementsByTagName('title')->item(0)->nodeValue;

    // fetch og:tags
    foreach( $doc->getElementsByTagName('meta') as $m ){

          // if had property
          if( $m->getAttribute('property') ){

              $prop = $m->getAttribute('property');

              // here search only og:tags
              if( preg_match("/og:/i", $prop) ){

                  // get results on an array -> nice for templating
                  $res['og_tags'][] =
                  array( 'property' => $m->getAttribute('property'),
                          'content' => $m->getAttribute('content') );
              }

          }
          // end if had property

          // fetch <meta name="description" ... >
          if( $m->getAttribute('name') == 'description' ){

            $res['description'] = $m->getAttribute('content');

          }


    }
    // end foreach

    // render JSON
    echo json_encode($res, JSON_PRETTY_PRINT |
    JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES);

}

返回此页面的结果（可能有更多信息）：

{
    "title": "php - Getting title and meta tags from external website - Stack Overflow",
    "og_tags": [
        {
            "property": "og:type",
            "content": "website"
        },
        {
            "property": "og:url",
            "content": "https://dev59.com/iHA65IYBdhLWcg3wogOe"
        },
        {
            "property": "og:site_name",
            "content": "Stack Overflow"
        },
        {
            "property": "og:image",
            "content": "https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon@2.png?v=73d79a89bded"
        },
        {
            "property": "og:title",
            "content": "Getting title and meta tags from external website"
        },
        {
            "property": "og:description",
            "content": "I want to try figure out how to get the\n\n&lt;title&gt;A common title&lt;/title&gt;\n&lt;meta name=\"keywords\" content=\"Keywords blabla\" /&gt;\n&lt;meta name=\"description\" content=\"This is the descript..."
        }
    ]
}

- SNS - Web et Informatique

1

在呈现JSON时，使用以下代码解码表情符号：$doc->loadHTML(mb_convert_encoding($HTML_DOCUMENT, 'HTML-ENTITIES', 'UTF-8')); - brasofilo

4

你最好采用DOM解析器 - 这是正确的做法。从长远来看，这样做比学习其他方法更节约时间。使用正则表达式解析HTML不可靠并且无法处理特殊情况。

- Joshua

1

+1，如果您只使用内置的DOM扩展而不是Simple HTML DOM解析器，那么您可能会快得多，并且不会用第三方库来混淆代码（尽管这样会向您的服务器环境添加要求，即默认启用DOM）。 - Wrikken

3

我根据cronoklee和shamittomar的帖子做出了一个解决方案，以便我可以从任何地方调用它并获得JSON返回。可以轻松解析为任何内容。

<?php
header('Content-type: application/json; charset=UTF-8');

if (!empty($_GET['url']))
{
    file_get_contents_curl($_GET['url']);
}
else
{
    echo "No Valid URL Provided.";
}


function file_get_contents_curl($url)
{
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    $data = curl_exec($ch);
    curl_close($ch);

    echo json_encode(getSiteOG($data), JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES);
}

function getSiteOG( $OGdata){
    $doc = new DOMDocument();
    @$doc->loadHTML($OGdata);
    $res['title'] = $doc->getElementsByTagName('title')->item(0)->nodeValue;

    foreach ($doc->getElementsByTagName('meta') as $m){
        $tag = $m->getAttribute('name') ?: $m->getAttribute('property');
        if(in_array($tag,['description','keywords']) || strpos($tag,'og:')===0) $res[str_replace('og:','',$tag)] = utf8_decode($m->getAttribute('content'));

    }

    return $res;
}
?>

- kevin walker

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- shamittomar · Accepted Answer

这就是应该的方式：

function file_get_contents_curl($url)
{
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    $data = curl_exec($ch);
    curl_close($ch);

    return $data;
}

$html = file_get_contents_curl("http://example.com/");

//parsing begins here:
$doc = new DOMDocument();
@$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');

//get and display what you need:
$title = $nodes->item(0)->nodeValue;

$metas = $doc->getElementsByTagName('meta');

for ($i = 0; $i < $metas->length; $i++)
{
    $meta = $metas->item($i);
    if($meta->getAttribute('name') == 'description')
        $description = $meta->getAttribute('content');
    if($meta->getAttribute('name') == 'keywords')
        $keywords = $meta->getAttribute('content');
}

echo "Title: $title". '<br/><br/>';
echo "Description: $description". '<br/><br/>';
echo "Keywords: $keywords";