如何在while循环中运行Scrapy

4

我正在进行一个项目,使用多个爬虫程序来抓取不同的网站。当用户在被询问是否继续时回答“是”,我希望能使爬虫程序重新运行。

keyword = input("enter keyword: ")
page_range = input("enter page range: ")

flag = True

while flag:

   process = CrawlProcess()
   process.crawl(crawler1, keyword, page_range)
   process.crawl(crawler2, keyword, page_range)
   process.crawl(crawler3, keyword, page_range)
   process.start()

   isContinue = input("Do you want to continue? (y/n): ")

   if isContinue == 'n':
      flag = False

但是我收到一个错误,说反应堆无法重新启动。
Traceback (most recent call last):
  File "/Users/user/Desktop/programs/eshopSpider/eshopSpider.py", line 47, in <module>
    process.start()
  File "/Users/user/opt/anaconda3/lib/python3.8/site-packages/scrapy/crawler.py", line 327, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Users/user/opt/anaconda3/lib/python3.8/site-packages/twisted/internet/base.py", line 1317, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/Users/user/opt/anaconda3/lib/python3.8/site-packages/twisted/internet/base.py", line 1299, in startRunning
    ReactorBase.startRunning(cast(ReactorBase, self))
  File "/Users/user/opt/anaconda3/lib/python3.8/site-packages/twisted/internet/base.py", line 843, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

所以我猜使用 while 循环行不通。我甚至不知道从哪里开始...


问题不在于while循环,而是Scrapy运行的特殊事件循环(在模块twisted中称为Reactor),一旦停止就不能再次使用。您应该查看twisted文档以了解是否可以重置Reactor - furas
2
在Google中找到了一篇名为重新启动Twisted Reactor的文章。这是一篇比较旧的帖子,我没有测试过,但也许会有用。它使用del来删除模块twisted以释放内存,并随后再次进行import操作。 - furas
@furas - 我可以确认这个方法可行,但有点不太正规!然而,这是我找到的唯一解决方案... - Jossy
4个回答

4

方法1:

scrapy创建了一个Reactor,在stop之后无法重复使用,但如果您在单独的进程中运行Crawler,则新进程将需要创建新的Reactor

import multiprocessing

def run_crawler(keyword, page_range):
   process = CrawlProcess()
   process.crawl(crawler1, keyword, page_range)
   process.crawl(crawler2, keyword, page_range)
   process.crawl(crawler3, keyword, page_range)
   process.start()

# --- main ---

keyword = input("enter keyword: ")
page_range = input("enter page range: ")

flag = True

while flag:

   p = multiprocessing(target=run_crawler, args=(keyword, page_range))
   p.start()
   p.join()

   isContinue = input("Do you want to continue? (y/n): ")

   if isContinue == 'n':
      flag = False

如果你使用threading而非multiprocessing,它将不起作用,因为线程共享变量,所以新的线程将使用与之前线程相同的Reactor


最小工作代码(在Linux上测试过)。

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    #start_urls = ['https://books.toscrape.com/']

    def __init__(self, keyword, page, *args, **kwargs):
        '''generate start_urls list'''
        super().__init__(*args, **kwargs)
        
        self.keyword = keyword
        self.page = int(page)
        self.start_urls = [f'https://books.toscrape.com/catalogue/page-{page}.html']

    def parse(self, response):
        print('[parse] url:', response.url)

        for book in response.css('article.product_pod'):
            title = book.css('h3 a::text').get()
            url = book.css('img::attr(src)').get()
            url = response.urljoin(url)
            yield {'page': self.page, 'keyword': self.keyword, 'title': title, 'image': url}

# --- run without project and save in `output.csv` ---

import multiprocessing
from scrapy.crawler import CrawlerProcess

def run_crawler(keyword, page_range):
    #from scrapy.crawler import CrawlerProcess

    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',
        # save in file CSV, JSON or XML
        'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
    })
    c.crawl(MySpider, keyword, page)
    c.crawl(MySpider, keyword, int(page)+1)
    c.crawl(MySpider, keyword, int(page)+2)
    c.start()
    
# --- main ---

if __name__ == '__main__':
    keyword = input("enter keyword: ")
    page    = input("enter page: ")
        
    running = True
    while running:

        p = multiprocessing.Process(target=run_crawler, args=(keyword, page))
        p.start()
        p.join()
        
        answer = input('Repeat [Y/n]? ').strip().lower()
        
        if answer == 'n':
            running = False

方法二:

在 Google 上找到了一篇文章:重启 Twisted 反应器

这是一篇旧文章,使用 del 从内存中删除模块 twisted,然后再次进行 import

keyword = input("enter keyword: ")
page_range = input("enter page range: ")

flag = True

while flag:

   process = CrawlProcess()
   process.crawl(crawler1, keyword, page_range)
   process.crawl(crawler2, keyword, page_range)
   process.crawl(crawler3, keyword, page_range)
   process.start()

   isContinue = input("Do you want to continue? (y/n): ")

   if isContinue == 'n':
      flag = False
           
   import sys
   del sys.modules['twisted.internet.reactor']
   from twisted.internet import reactor
   from twisted.internet import default
   default.install()                  

最小工作代码(在Linux上测试过)

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    #start_urls = ['https://books.toscrape.com/']

    def __init__(self, keyword, page, *args, **kwargs):
        '''generate start_urls list'''
        super().__init__(*args, **kwargs)
        
        self.keyword = keyword
        self.page = int(page)
        self.start_urls = [f'https://books.toscrape.com/catalogue/page-{page}.html']

    def parse(self, response):
        print('[parse] url:', response.url)

        for book in response.css('article.product_pod'):
            title = book.css('h3 a::text').get()
            url = book.css('img::attr(src)').get()
            url = response.urljoin(url)
            yield {'page': self.page, 'keyword': self.keyword, 'title': title, 'image': url}

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

def run_crawler(keyword, page):

    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',
        # save in file CSV, JSON or XML
        'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
    })
    c.crawl(MySpider, keyword, page)
    c.crawl(MySpider, keyword, int(page)+1)
    c.crawl(MySpider, keyword, int(page)+2)
    c.start()
    
# --- main ---

if __name__ == '__main__':
    keyword = input("enter keyword: ")
    page    = input("enter page: ")
        
    running = True
    while running:
    
        run_crawler(keyword, page)
        
        answer = input('Repeat [Y/n]? ').strip().lower()
        
        if answer == 'n':
            running = False
            
        import sys
        del sys.modules['twisted.internet.reactor']
        from twisted.internet import reactor
        from twisted.internet import default
        default.install()            

方法3:

看起来你可以使用CrawlRunner代替CrawlProcess - 但我还没有测试过。

根据文档中的最后一个示例在同一进程中运行多个爬虫,我创建了代码,在反应器内运行while循环(因此不必停止它),但它首先启动一个Spider,接下来运行第二个Spider,然后请求继续,并再次运行第一个Spider,接下来运行第二个Spider。它不会同时运行两个Spiders,但是也许可以进行某些更改。

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    #start_urls = ['https://books.toscrape.com/']

    def __init__(self, keyword, page, *args, **kwargs):
        '''generate start_urls list'''
        super().__init__(*args, **kwargs)
        
        self.keyword = keyword
        self.page = int(page)
        self.start_urls = [f'https://books.toscrape.com/catalogue/page-{page}.html']

    def parse(self, response):
        print('[parse] url:', response.url)

        for book in response.css('article.product_pod'):
            title = book.css('h3 a::text').get()
            url = book.css('img::attr(src)').get()
            url = response.urljoin(url)
            yield {'page': self.page, 'keyword': self.keyword, 'title': title, 'image': url}

# --- run without project and save in `output.csv` ---

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

@defer.inlineCallbacks
def run_crawler():

    running = True
    while running:

        yield runner.crawl(MySpider, keyword, page)
        yield runner.crawl(MySpider, keyword, int(page)+1)
        yield runner.crawl(MySpider, keyword, int(page)+2)

        answer = input('Repeat [Y/n]? ').strip().lower()
    
        if answer == 'n':
            running = False
            reactor.stop()
            #return

# --- main ---

if __name__ == '__main__':
    keyword = input("enter keyword: ")
    page    = input("enter page: ")

    configure_logging()        
    
    runner = CrawlerRunner({
        'USER_AGENT': 'Mozilla/5.0',
        # save in file CSV, JSON or XML
        'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
    })

    run_crawler()

    reactor.run()     

编辑:

现在所有的网络爬虫同时运行。

@defer.inlineCallbacks
def run_crawler():

    running = True
    while running:
    
        runner.crawl(MySpider, keyword, page)
        runner.crawl(MySpider, keyword, int(page)+1)
        runner.crawl(MySpider, keyword, int(page)+2)
        
        d = runner.join()
        yield d

        answer = input('Repeat [Y/n]? ').strip().lower()
    
        if answer == 'n':
            running = False
            reactor.stop()
            #return

2
你可以通过在其他 scrapy 或 reactor 导入之前在顶层安装 reactor,然后在每次爬取后删除 reactor 来循环运行爬虫。这对我起作用了。

main.py

import time
from spider_utils import run_crawler

while 1:
    run_crawler('spider1')
    run_crawler('spider2')
    time.sleep(60)

spider_utils.py

from scrapy.utils.reactor import install_reactor
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


def run_crawler(spider_name: str):
    """Run isolated spider and restart reactor to run another spider afterwards."""
    process = CrawlerProcess(get_project_settings())
    process.crawl(spider_name)
    process.start()

    import sys
    del sys.modules['twisted.internet.reactor']

1
嗨。无法使上面的代码工作,但我将install_reactor移到了run_crawler的开头,它就像魔术一样成功了!我还将import sys从函数中移到了模块的顶层... - undefined

0
from twisted.internet import reactor #only this is supposed to be here, we will be deleting the reactor after each run, using the main

configure_logging()
settings = get_project_settings()
runner = CrawlerRunner(settings)    
d = runner.crawl('your spider class name')
d.addBoth(lambda _: reactor.stop())
reactor.run()  # the script will block here until all crawling jobs are finished

del sys.modules['twisted.internet.reactor'] #deleting the reactor, because we want to run a for loop, the reactor will be imported again at the top
default.install()

0

你可以移除while循环,而使用回调函数代替。

编辑:添加示例:

def callback_f():
    # stuff #
    calling_f()

def calling_f():
    answer = input("Continue? (y/n)")
    if not answer == 'n':
        callback_f()
        
callback_f()


感谢您提供这个例子!对我帮助很大! - invisibleufo101
您能详细介绍一下如何在上面的示例中实现吗? - Jossy

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接