PhantomJS出现OSError错误: [Errno 9] 坏的文件描述符。

3
当我在Scrapy中间件中使用phantomjs时,有时会出现以下错误提示:
Traceback (most recent call last):
 File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python2.7/dist-
packages/scrapy/core/downloader/middleware.py", line 37, in 
process_request
response = yield method(request=request, spider=spider)
File "/home/ttc/ruyi-
scrapy/saibolan/saibolan/hz_webdriver_middleware.py", line 47, in 
 process_request
driver.quit()
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/phantomjs/webdriver.py", line 76, in quit
self.service.stop()
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/common/service.py", line 149, in stop
self.send_remote_shutdown_command()
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/phantomjs/service.py", line 67, in send_remote_shutdown_command
os.close(self._cookie_temp_file_handle)
OSError: [Errno 9] Bad file descriptor

实际上它并不是每次都出现,我爬了80页,只出现了30次,而且这是在PhantomJS中间件中。

class HZPhantomjsMiddleware(object):

def __init__(self, settings):
    self.phantomjs_driver_path = settings.get('PHANTOMJS_DRIVER_PATH')
    self.cloud_mode = settings.get('CLOUD_MODE')

@classmethod
def from_crawler(cls, crawler):
    return cls(crawler.settings)

def process_request(self, request, spider):
    # 线上需要 display, 本地调试可以注释掉
    # if self.cloud_mode:
    #     display = Display(visible=0, size=(800, 600))
    #     display.start()
    dcap = dict(DesiredCapabilities.PHANTOMJS)
    dcap["phantomjs.page.settings.userAgent"] = (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36")
    driver = webdriver.PhantomJS(
        self.phantomjs_driver_path, desired_capabilities=dcap)
    # chrome_options = webdriver.ChromeOptions()
    # prefs = {"profile.managed_default_content_settings.images": 2}
    # chrome_options.add_experimental_option("prefs", prefs)
    # driver = webdriver.Chrome(self.chrome_driver_path, chrome_options=chrome_options)
    driver.get(request.url)
    try:
        element = WebDriverWait(driver, 15).until(
            ec.presence_of_element_located(
                (By.XPATH, '//div[@class="txt-box"]|//h4[@class="weui_media_title"]|//div[@class="rich_media_content "]'))
        )
        body = driver.page_source
        time.sleep(1)
        driver.quit()
        return HtmlResponse(request.url, body=body, encoding='utf-8', request=request)
    except:
        driver.quit()
        spider.logger.error('Ignore request, url: {}'.format(request.url))
        raise IgnoreRequest()

我不知道是什么导致了这个错误。

你是如何运行程序的?看起来似乎是文件系统出现了错误,也许是磁盘空间不足或者其他问题。 - Burhan Khalid
@BurhanKhalid 我通常运行Scrapy程序,命令为 "scrapy crawl spider --loglevel=INFO --logfile=1.log"。 - TtC
2个回答

4
截至2016年7月,driver.close()和driver.quit()对我来说并不足够。它可以结束节点进程,但无法结束它生成的phantomjs子进程。
此GitHub问题的讨论中,唯一对我有效的解决方案是运行:
import signal

driver.service.process.send_signal(signal.SIGTERM) # kill the specific phantomjs child proc
driver.quit()                                      # quit the node proc

0
问题在这里描述:https://github.com/SeleniumHQ/selenium/issues/3216。建议的解决方法(明确指定cookie文件)对我有效:
driver = webdriver.PhantomJS(self.phantomjs_driver_path, desired_capabilities=dcap, service_args=['--cookies-file=/tmp/cookies.txt'])

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接