我有50个Sidekiq线程在爬取网络,几周前这些线程在运行了约20分钟后开始卡死。当我进行回溯转储时,大多数线程都卡在net/http initialize上:
/app/vendor/ruby-2.1.2/lib/ruby/2.1.0/net/http.rb:879:in `initialize'
/app/vendor/ruby-2.1.2/lib/ruby/2.1.0/net/http.rb:879:in `open'
/app/vendor/ruby-2.1.2/lib/ruby/2.1.0/net/http.rb:879:in `block in connect'
/app/vendor/ruby-2.1.2/lib/ruby/2.1.0/timeout.rb:76:in `timeout'
/app/vendor/ruby-2.1.2/lib/ruby/2.1.0/net/http.rb:878:in `connect'
/app/vendor/ruby-2.1.2/lib/ruby/2.1.0/net/http.rb:863:in `do_start'
/app/vendor/ruby-2.1.2/lib/ruby/2.1.0/net/http.rb:858:in `start'
/app/vendor/bundle/ruby/2.1.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:700:in `start'
/app/vendor/bundle/ruby/2.1.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:631:in `connection_for'
/app/vendor/bundle/ruby/2.1.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:994:in `request'
/app/vendor/bundle/ruby/2.1.0/gems/mechanize-2.7.2/lib/mechanize/http/agent.rb:257:in `fetch'
/app/vendor/bundle/ruby/2.1.0/gems/mechanize-2.7.2/lib/mechanize/http/agent.rb:974:in `response_redirect'
/app/vendor/bundle/ruby/2.1.0/gems/mechanize-2.7.2/lib/mechanize/http/agent.rb:298:in `fetch'
/app/vendor/bundle/ruby/2.1.0/gems/mechanize-2.7.2/lib/mechanize.rb:432:in `get'
/app/app/workers/crawl_page.rb:24:in `block in perform'
/app/vendor/ruby-2.1.2/lib/ruby/2.1.0/timeout.rb:91:in `block in timeout'
/app/vendor/ruby-2.1.2/lib/ruby/2.1.0/timeout.rb:35:in `block in catch'
/app/vendor/ruby-2.1.2/lib/ruby/2.1.0/timeout.rb:35:in `catch'
/app/vendor/ruby-2.1.2/lib/ruby/2.1.0/timeout.rb:35:in `catch'
/app/vendor/ruby-2.1.2/lib/ruby/2.1.0/timeout.rb:106:in `timeout'
我原以为在整个调用中使用超时(timeout)可以防止sidekiq在net/http上卡住,例如:Timeout::timeout(APP_CONFIG['crawl_page_timeout']) { @page = agent.get(url) }
但后来我看到一些老帖子,提到ruby的Timeout不是线程安全的:http://blog.headius.com/2008/02/rubys-threadraise-threadkill-timeoutrb.html
ruby的Timeout现在是否仍然不安全?
我知道很多人用Ruby编写网络爬虫。如果Timeout不安全,人们如何处理net/http被卡住的问题呢?
更新:
我已经切换到HTTPClient(它专门声明自己是线程安全的)来替换Mechanize。但我们似乎仍然卡在初始化线程上。这可能是由于ruby的Timeout无法正常工作,也可能是一个sidekiq问题。以下是最近挂起的sidekiq线程的堆栈跟踪:
/app/vendor/bundle/ruby/2.1.0/gems/httpclient-2.4.0/lib/httpclient/session.rb:805:in `initialize'
/app/vendor/bundle/ruby/2.1.0/gems/httpclient-2.4.0/lib/httpclient/session.rb:805:in `new'
/app/vendor/bundle/ruby/2.1.0/gems/httpclient-2.4.0/lib/httpclient/session.rb:805:in `create_socket'
/app/vendor/bundle/ruby/2.1.0/gems/httpclient-2.4.0/lib/httpclient/session.rb:752:in `block in connect'
/app/vendor/ruby-2.1.2/lib/ruby/2.1.0/timeout.rb:91:in `block in timeout'
/app/vendor/ruby-2.1.2/lib/ruby/2.1.0/timeout.rb:101:in `call'
/app/vendor/ruby-2.1.2/lib/ruby/2.1.0/timeout.rb:101:in `timeout'
/app/vendor/ruby-2.1.2/lib/ruby/2.1.0/timeout.rb:127:in `timeout'
/app/vendor/bundle/ruby/2.1.0/gems/httpclient-2.4.0/lib/httpclient/session.rb:751:in `connect'
/app/vendor/bundle/ruby/2.1.0/gems/httpclient-2.4.0/lib/httpclient/session.rb:609:in `query'
/app/vendor/bundle/ruby/2.1.0/gems/httpclient-2.4.0/lib/httpclient/session.rb:164:in `query'
/app/vendor/bundle/ruby/2.1.0/gems/httpclient-2.4.0/lib/httpclient.rb:1087:in `do_get_block'
/app/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.2.239/lib/new_relic/agent/instrumentation/httpclient.rb:34:in `block in do_get_block_with_newrelic'
/app/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.2.239/lib/new_relic/agent/cross_app_tracing.rb:43:in `tl_trace_http_request'
/app/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.2.239/lib/new_relic/agent/instrumentation/httpclient.rb:33:in `do_get_block_with_newrelic'
/app/vendor/bundle/ruby/2.1.0/gems/httpclient-2.4.0/lib/httpclient.rb:891:in `block in do_request'
/app/vendor/bundle/ruby/2.1.0/gems/httpclient-2.4.0/lib/httpclient.rb:985:in `protect_keep_alive_disconnected'
/app/vendor/bundle/ruby/2.1.0/gems/httpclient-2.4.0/lib/httpclient.rb:890:in `do_request'
/app/vendor/bundle/ruby/2.1.0/gems/httpclient-2.4.0/lib/httpclient.rb:963:in `follow_redirect'
/app/vendor/bundle/ruby/2.1.0/gems/httpclient-2.4.0/lib/httpclient.rb:776:in `request'
/app/vendor/bundle/ruby/2.1.0/gems/httpclient-2.4.0/lib/httpclient.rb:677:in `get'
/app/app/ohm_models/queued_page.rb:20:in `run_crawl'