PHP并行curl请求

42

我正在做一个简单的应用程序,从15个不同的网址读取json数据。我有一个特殊需求,需要在服务器端执行此操作。我使用file_get_contents($url)

由于我正在使用file_get_contents($url),所以我编写了一个简单的脚本,如下:

$websites = array(
    $url1,
    $url2,
    $url3,
     ...
    $url15
);

foreach ($websites as $website) {
    $data[] = file_get_contents($website);
}

证明它非常慢,因为它等待第一个请求,然后才执行下一个请求。


3
谷歌搜索“curl并行请求”会返回许多结果。 - matino
1
PHP是一种单线程语言,它没有任何内部支持并发的机制。你可以编写一个脚本,获取一个单独的URL(作为参数提供),并执行15个实例。 - GordonM
1
感谢您所有的意见。 :) - user1205408
14
如果有人偶然看到这个页面,GordonM上面的评论是不正确的;PHP curl库专门支持多个并行请求。除此之外,你可以使用pthreads扩展创建完全多线程的PHP应用程序,但这对于curl扩展来说是完全不必要和过度的。 - thomasrutter
4个回答

149
如果您想要实现多个curl请求,那么可以使用如下代码:
```php $nodes = array($url1, $url2, $url3); $node_count = count($nodes); $curl_arr = array(); $master = curl_multi_init();
for($i = 0; $i < $node_count; $i++) { $url = $nodes[$i]; $curl_arr[$i] = curl_init($url); curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true); curl_multi_add_handle($master, $curl_arr[$i]); }
do { curl_multi_exec($master,$running); } while($running > 0); $results = array(); for($i = 0; $i < $node_count; $i++) { $results[] = curl_multi_getcontent($curl_arr[$i]); } print_r($results); ```
这段代码可以同时发送多个curl请求,然后等待所有请求都完成后,一次性获取全部请求的响应内容。

1
我可以知道 $running 包含什么内容吗? - ramya br
@ramyabr 布尔值(引用),用于检查multicurl是否仍在运行并获取数据。 - Shlizer
你的multi_exec循环可以工作,但它也会浪费大量的CPU资源,使用100%的CPU(1个核心),直到所有内容都被下载完成,因为你的循环正在尽可能快地spamming curl_multi_exec()这个异步函数,直到所有内容都被下载完成。如果你将它改为do {curl_multi_exec($master,$running);if($running>0){curl_multi_select($mh,1);}} while($running > 0);那么它将只使用约1%的CPU资源,而不是100%的CPU资源(一个更好的循环仍然可以构建,这将是更好的for(;;){curl_multi_exec($mh,$running);if($running<1)break;curl_multi_select($mh,1);})。 - hanshenrik
@DivyeshPrajapati 它的表现很好,直到你检查它消耗了多少 CPU,参见我上面的评论 ^^ - hanshenrik
@Shlizer 这是不正确的,$running 包含一个整数,即仍未完成下载整个响应的 curl 句柄数量(虽然可以将变量视为布尔值使用,因为 int(0)==false 且 int(>=1)==true,但变量本身是整数而不是布尔值,它可以包含任何大于等于 0 的数字,例如 int(5))。 - hanshenrik
@hanshenrik 没有检查过,但它肯定可以减少请求时间... 我曾经同时发出了10个请求,每个请求需要3秒钟,所以总共需要大约25-30秒钟,但是使用这个之后,时间缩短到了5-8秒钟。 - Divyesh Prajapati

6

我不太喜欢现有答案的任何方法。

Timo的代码:可能会在CURLM_CALL_MULTI_PERFORM期间进行sleep/select(),这是错误的。当($still_running > 0 && $exec != CURLM_CALL_MULTI_PERFORM)时,它也可能无法进行睡眠,这可能会使代码以100%的CPU使用率(1个核心)旋转而没有任何原因。

Sudhir的代码:当$still_running > 0时不会进行睡眠,并且spam-call 异步函数curl_multi_exec()直到所有内容都被下载完成,这会导致PHP在所有内容被下载完成之前使用100%的CPU(1个CPU核心),换句话说,它在下载时未能进行睡眠。

这里提供了一种没有上述问题的方法:

$websites = array(
    "http://google.com",
    "http://example.org"
    // $url2,
    // $url3,
    // ...
    // $url15
);
$mh = curl_multi_init();
foreach ($websites as $website) {
    $worker = curl_init($website);
    curl_setopt_array($worker, [
        CURLOPT_RETURNTRANSFER => 1
    ]);
    curl_multi_add_handle($mh, $worker);
}
for (;;) {
    $still_running = null;
    do {
        $err = curl_multi_exec($mh, $still_running);
    } while ($err === CURLM_CALL_MULTI_PERFORM);
    if ($err !== CURLM_OK) {
        // handle curl multi error?
    }
    if ($still_running < 1) {
        // all downloads completed
        break;
    }
    // some haven't finished downloading, sleep until more data arrives:
    curl_multi_select($mh, 1);
}
$results = [];
while (false !== ($info = curl_multi_info_read($mh))) {
    if ($info["result"] !== CURLE_OK) {
        // handle download error?
    }
    $results[curl_getinfo($info["handle"], CURLINFO_EFFECTIVE_URL)] = curl_multi_getcontent($info["handle"]);
    curl_multi_remove_handle($mh, $info["handle"]);
    curl_close($info["handle"]);
}
curl_multi_close($mh);
var_export($results);

请注意,所有3种方法(包括我的回答、Sudhir的回答和Timo的回答)共同面临的一个问题是,它们会同时打开所有连接。如果您需要获取100万个网站,这些脚本将尝试同时打开100万个连接。如果您只想一次下载50个网站,或者类似情况,请尝试:
$websites = array(
    "http://google.com",
    "http://example.org"
    // $url2,
    // $url3,
    // ...
    // $url15
);
var_dump(fetch_urls($websites,50));
function fetch_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $return_fault_reason = true): array
{
    if ($max_connections < 1) {
        throw new InvalidArgumentException("max_connections MUST be >=1");
    }
    foreach ($urls as $key => $foo) {
        if (! is_string($foo)) {
            throw new \InvalidArgumentException("all urls must be strings!");
        }
        if (empty($foo)) {
            unset($urls[$key]); // ?
        }
    }
    unset($foo);
    // DISABLED for benchmarking purposes: $urls = array_unique($urls); // remove duplicates.
    $ret = array();
    $mh = curl_multi_init();
    $workers = array();
    $work = function () use (&$ret, &$workers, &$mh, $return_fault_reason) {
        // > If an added handle fails very quickly, it may never be counted as a running_handle
        while (1) {
            do {
                $err = curl_multi_exec($mh, $still_running);
            } while ($err === CURLM_CALL_MULTI_PERFORM);
            if ($still_running < count($workers)) {
                // some workers finished, fetch their response and close them
                break;
            }
            $cms = curl_multi_select($mh, 1);
            // var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
        }
        while (false !== ($info = curl_multi_info_read($mh))) {
            // echo "NOT FALSE!";
            // var_dump($info);
            {
                if ($info['msg'] !== CURLMSG_DONE) {
                    continue;
                }
                if ($info['result'] !== CURLE_OK) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int) $info['handle']]] = print_r(array(
                            false,
                            $info['result'],
                            "curl_exec error " . $info['result'] . ": " . curl_strerror($info['result'])
                        ), true);
                    }
                } elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int) $info['handle']]] = print_r(array(
                            false,
                            $err,
                            "curl error " . $err . ": " . curl_strerror($err)
                        ), true);
                    }
                } else {
                    $ret[$workers[(int) $info['handle']]] = curl_multi_getcontent($info['handle']);
                }
                curl_multi_remove_handle($mh, $info['handle']);
                assert(isset($workers[(int) $info['handle']]));
                unset($workers[(int) $info['handle']]);
                curl_close($info['handle']);
            }
        }
        // echo "NO MORE INFO!";
    };
    foreach ($urls as $url) {
        while (count($workers) >= $max_connections) {
            // echo "TOO MANY WORKERS!\n";
            $work();
        }
        $neww = curl_init($url);
        if (! $neww) {
            trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of system resources", E_USER_WARNING);
            if ($return_fault_reason) {
                $ret[$url] = array(
                    false,
                    - 1,
                    "curl_init() failed"
                );
            }
            continue;
        }
        $workers[(int) $neww] = $url;
        curl_setopt_array($neww, array(
            CURLOPT_RETURNTRANSFER => 1,
            CURLOPT_SSL_VERIFYHOST => 0,
            CURLOPT_SSL_VERIFYPEER => 0,
            CURLOPT_TIMEOUT_MS => $timeout_ms
        ));
        curl_multi_add_handle($mh, $neww);
        // curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS
    }
    while (count($workers) > 0) {
        // echo "WAITING FOR WORKERS TO BECOME 0!";
        // var_dump(count($workers));
        $work();
    }
    curl_multi_close($mh);
    return $ret;
}

这将下载整个列表,并且不会同时下载超过50个URL(但即使使用这种方法,所有结果也会存储在RAM中,因此即使使用这种方法也可能会耗尽RAM;如果想将其存储在数据库中而不是内存中,则可以修改curl_multi_getcontent部分以将其存储在数据库中而不是存储在持久的RAM变量中)。


请问$return_fault_reason代表什么意思? - Ali Niaz
@AliNiaz 抱歉,在从此答案复制代码时忘记了 $return_fault_reason 是一个参数,用于告知下载失败时是应该被忽略还是需要附带错误信息;我已经更新了代码,加入了 $return_fault_reason 参数。 - hanshenrik

0
@hanshenrik的回答对于CPU优化来说非常好,但是非常复杂。
在循环体的末尾调用`curl_multi_select`就足够了,它会强制使循环进入休眠状态,并等待任何curl_multi连接上的活动,从而避免循环导致100%的CPU使用率。
do {
  curl_multi_exec($multiHandle, $stillRunning);
  // just block the loop until there's activity on any curl_multi connection
  curl_multi_select($multiHandle, 1);
} while ($stillRunning > 0);

0
我想提供一个更完整的例子,避免CPU占用率达到100%并在出现轻微错误或意外情况时崩溃。
它还向您展示如何获取标头、正文、请求信息和手动重定向。
免责声明:此代码旨在扩展并实现为库或快速起点,因此其中的函数被保持最少。
function mtime(){
    return microtime(true);
}
function ptime($prev){
    $t = microtime(true) - $prev;
    $t = $t * 1000;
    return str_pad($t, 20, 0, STR_PAD_RIGHT);
}

// This function exists to add compatibility for CURLM_CALL_MULTI_PERFORM for old curl versions, on modern curl it will only run once and be the equivalent of calling curl_multi_exec
function curl_multi_exec_full($mh, &$still_running) {
    // In theory curl_multi_exec should never return CURLM_CALL_MULTI_PERFORM (-1) because it has been deprecated
    // In practice it sometimes does
    // So imagine that this just runs curl_multi_exec once and returns it's value
    do {
        $state = curl_multi_exec($mh, $still_running);

        // curl_multi_select($mh, $timeout) simply blocks for $timeout seconds while curl_multi_exec() returns CURLM_CALL_MULTI_PERFORM
        // We add it to prevent CPU 100% usage in case this thing misbehaves (especially for old curl on windows)
    } while ($still_running > 0 && $state === CURLM_CALL_MULTI_PERFORM && curl_multi_select($mh, 0.1));
    return $state;
}

// This function replaces curl_multi_select and makes the name make more sense, since all we're doing is waiting for curl, it also forces a minimum sleep time between requests to avoid excessive CPU usage.
function curl_multi_wait($mh, $minTime = 0.001, $maxTime = 1){
    $umin = $minTime*1000000;

    $start_time = microtime(true);

    // it sleeps until there is some activity on any of the descriptors (curl files)
    // it returns the number of descriptors (curl files that can have activity)
    $num_descriptors = curl_multi_select($mh, $maxTime);

    // if the system returns -1, it means that the wait time is unknown, and we have to decide the minimum time to wait
    // but our `$timespan` check below catches this edge case, so this `if` isn't really necessary
    if($num_descriptors === -1){
        usleep($umin);
    }

    $timespan = (microtime(true) - $start_time);

    // This thing runs very fast, up to 1000 times for 2 urls, which wastes a lot of CPU
    // This will reduce the runs so that each interval is separated by at least minTime
    if($timespan < $umin){
        usleep($umin - $timespan);
        //print "sleep for ".($umin - $timeDiff).PHP_EOL;
    }
}


$handles = [
    [
        CURLOPT_URL=>"http://example.com/",
        CURLOPT_HEADER=>false,
        CURLOPT_RETURNTRANSFER=>true,
        CURLOPT_FOLLOWLOCATION=>false,
    ],
    [
        CURLOPT_URL=>"http://www.php.net",
        CURLOPT_HEADER=>false,
        CURLOPT_RETURNTRANSFER=>true,
        CURLOPT_FOLLOWLOCATION=>false,

        // this function is called by curl for each header received
        // This complies with RFC822 and RFC2616, please do not suggest edits to make use of the mb_ string functions, it is incorrect!
        // https://dev59.com/3mox5IYBdhLWcg3wekIr#41135574
        CURLOPT_HEADERFUNCTION=>function($ch, $header)
        {
            print "header from http://www.php.net: ".$header;
            //$header = explode(':', $header, 2);
            //if (count($header) < 2){ // ignore invalid headers
            //    return $len;
            //}

            //$headers[strtolower(trim($header[0]))][] = trim($header[1]);

            return strlen($header);
        }
    ]
];




//create the multiple cURL handle
$mh = curl_multi_init();

$chandles = [];
foreach($handles as $opts) {
    // create cURL resources
    $ch = curl_init();

    // set URL and other appropriate options
    curl_setopt_array($ch, $opts);

    // add the handle
    curl_multi_add_handle($mh, $ch);

    $chandles[] = $ch;
}


//execute the multi handle
$prevRunning = null;
$count = 0;
do {
    $time = mtime();

    // $running contains the number of currently running requests
    $status = curl_multi_exec_full($mh, $running);
    $count++;

    print ptime($time).": curl_multi_exec status=$status running $running".PHP_EOL;

    // One less is running, meaning one has finished
    if($running < $prevRunning){
        print ptime($time).": curl_multi_info_read".PHP_EOL;

        // msg: The CURLMSG_DONE constant. Other return values are currently not available.
        // result: One of the CURLE_* constants. If everything is OK, the CURLE_OK will be the result.
        // handle: Resource of type curl indicates the handle which it concerns.
        while ($read = curl_multi_info_read($mh, $msgs_in_queue)) {

            $info = curl_getinfo($read['handle']);

            if($read['result'] !== CURLE_OK){
                // handle the error somehow
                print "Error: ".$info['url'].PHP_EOL;
            }

            if($read['result'] === CURLE_OK){
                /*
                // This will automatically follow the redirect and still give you control over the previous page
                // TODO: max redirect checks and redirect timeouts
                if(isset($info['redirect_url']) && trim($info['redirect_url'])!==''){

                    print "running redirect: ".$info['redirect_url'].PHP_EOL;
                    $ch3 = curl_init();
                    curl_setopt($ch3, CURLOPT_URL, $info['redirect_url']);
                    curl_setopt($ch3, CURLOPT_HEADER, 0);
                    curl_setopt($ch3, CURLOPT_RETURNTRANSFER, 1);
                    curl_setopt($ch3, CURLOPT_FOLLOWLOCATION, 0);
                    curl_multi_add_handle($mh,$ch3);
                }
                */

                print_r($info);
                $body = curl_multi_getcontent($read['handle']);
                print $body;
            }
        }
    }

    // Still running? keep waiting...
    if ($running > 0) {
        curl_multi_wait($mh);
    }

    $prevRunning = $running;

} while ($running > 0 && $status == CURLM_OK);

//close the handles
foreach($chandles as $ch){
    curl_multi_remove_handle($mh, $ch);
}
curl_multi_close($mh);

print $count.PHP_EOL;

你的代码处理 CURLM_CALL_MULTI_PERFORM(因此是 CCMP)有误,如果你得到 CCMP,就不应该运行 select(),而应该再次调用 multi_exec()。更糟糕的是,从 2012 年左右开始,curl 不再返回 CCMP,所以你的 $state === CCMP 检查将始终失败,这意味着你的执行循环将在第一次迭代后总是退出。 - hanshenrik
@hanshenrik 当我阅读文档(我不记得在哪里)时,它说选择除了在CCMP中添加等待时间外什么也没做,这实际上对于Windows是必需的,否则它会在旧版Curl上达到100%的CPU利用率,因此如果我删除选择,我将破坏它对于2012年之前的Windows上的curl。我确实运行选择,它在curl_multi_wait函数内部,注意它在代码下面逐个计算进程完成情况,这意味着我们不关心curl_multi_exec_full是否在一个循环中完成或运行选择,新版curl不会这样。 - Timo Huovinen
@hanshenrik 很有趣,我之前并不知道这一点。 为了应对旧版 Windows 中的一个 bug,我在 CCMP 中添加了 curl_multi_select 功能,它会像睡眠一样起作用。如果将其移除,我稍微有些担心会降低代码的“鲁棒性”,但也可以接受。 - Timo Huovinen
@hanshenrik 在 curl_multi_exec_full 中保留 curl_multi_select 对 CCMP 有什么坏处吗? - Timo Huovinen
CCMP的意思是“现在有更多的数据可以被读取了,你应该立即运行read(),它将不会阻塞”- 然后你的代码继续运行select()(而不是read())并等待更多的数据到来,而不是去read() - 如果下一批数据到达缓慢,或者如果一些缓冲区已经满了正在等待读取,我假设这可能会拖慢代码的速度(当你应该read()时却在等待select())。 - hanshenrik
显示剩余8条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接