使用PHP获取字符串中的所有URL

5
我正在尝试找出一种从文本字符串中获取URL数组的方法。 这个文本将会被格式化,就像这样:

这里有一些随机的文本

http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphones-bezel-a-massive-notification-light/?grcc=88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2=835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033fdeed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~

http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tickets-for-disrupt-sf/

显然,这些链接可以是任何东西(而且可能有很多链接,那些只是我现在正在测试的链接)。如果我使用一个简单的URL,像这样的正则表达式就可以了。
我正在使用:
preg_match_all('((https?|ftp|gopher|telnet|file|notes|ms-help):'.
    '((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)',
    $bodyMessage, $matches, PREG_PATTERN_ORDER);

当我执行print_r( $matches);时,得到的结果如下:
Array ( [0] => Array (
    [0] => http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphon=
    [1] => http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tick= 
    [2] => http://techcrunch.co=
    [3] => http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-ip= 
    [4] => http://techcrunch.com/2012/07/20/last-day-to-purc=
    [5] => http://tec=
)
...

那个数组中的所有项都不是来自上面链接的完整链接。

有人知道一个好方法来获取我需要的内容吗?我找到了一堆用于获取PHP链接的正则表达式,但没有一个可行。

谢谢!

编辑:

好的,所以我正在从一封电子邮件中获取这些链接。脚本解析电子邮件,获取消息正文,然后尝试从中获取链接。 经过调查电子邮件,似乎由于某种原因在URL中间添加了一个空格。以下是我的PHP脚本看到的消息正文输出。

 --00248c711bb99ca36d04c54ba5c6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphon= es-bezel-a-massive-notification-light/?grcc=3D88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2= =3D835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033f= deed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~ http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tick= ets-for-disrupt-sf/ --00248c711bb99ca36d04c54ba5c6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable 

有没有关于如何避免URLS中断的建议?

编辑2

根据Laurnet的建议,我运行了这段代码:

 $bodyMessage = str_replace("= ", "",$bodyMessage);

然而,当我将其输出时,它似乎不想替换“= ”。

 --00248c711bb99ca36d04c54ba5c6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphon= es-bezel-a-massive-notification-light/?grcc=3D88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2= =3D835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033f= deed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~ http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tick= ets-for-disrupt-sf/ --00248c711bb99ca36d04c54ba5c6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable 

在我看来没什么问题:http://ideone.com/ulJ4a。 - mellamokb
嗯,有趣...我刚刚编辑了我的问题...这些链接来自一封电子邮件,然后我解析获取正文消息...看起来像是电子邮件在链接的中间放了一个空格!有什么建议吗? - Bill
那些=的实例看起来很像一种分块编码,你的代码似乎没有正确处理。 - mellamokb
在处理字符串之前,我只需将所有 "= " 替换为空即可。 - laurent
如果我的回答有用,请接受答案,@Bill。 - Eswar Rajesh Pinapala
显示剩余4条评论
4个回答

9
    /**
     *
     * @get URLs from string (string maybe a url)
     *
     * @param string $string

     * @return array
     *
     */
    function getUrls($string) {
        $regex = '/https?\:\/\/[^\" ]+/i';
        preg_match_all($regex, $string, $matches);
        //return (array_reverse($matches[0]));
        return ($matches[0]);
}

1
你还应该将新行添加到否定$regex = '/https?\:\/\/[^\" \n]+/i';中。 - unloco

1
使用以下代码,您将找到一个名为urls_in_string的数组,在索引0处$urls_in_string[0],您将找到所有的URL。
    $urls_in_string = [];
    $string_with_urls = "Worlds most popular socila networking website in https://www.facebook.com. We have many such othe websites like https://twitter.com/home and https://www.linkedin.com/feed/ etc.";
    $reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,6}(\/\S*)?/im";
    preg_match_all($reg_exUrl, $string_with_urls, $urls_in_string);
    print_r($urls_in_string);




// OutPut 
/*
Array
(
    [0] => Array
        (
            [0] => https://www.facebook.com
            [1] => https://twitter.com/home
            [2] => https://www.linkedin.com/feed/
        )

    [1] => Array
        (
            [0] => https
            [1] => https
            [2] => https
        )

    [2] => Array
        (
            [0] => 
            [1] => /home
            [2] => /feed/
        )

)
*/

0
you can do something like following

$url = "http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphones-bezel-a-massive-notification-light/?grcc=88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2=835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033fdeed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~

http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tickets-for-disrupt-sf/";

$dataArray = explode("http",$url);

echo "<pre>";print_r($dataArray);

this will return like following array

Array
(
 [0] => 
 [1] => ://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphones-bezel-a-massive-notification-light/?grcc=88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2=835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033fdeed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~


 [2] => ://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tickets-for-disrupt-sf/
)

when you extract above output please prepend http, I think this will help you 

Happy Coding

0
请使用以下正则表达式。
$regex = "(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))";

希望对你有帮助。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接