在Scrapy Python中向process.crawl传递参数

Question

在Scrapy Python中向process.crawl传递参数

36

我希望能够得到与以下命令行相同的结果： scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json

我的脚本如下：

import scrapy
from linkedin_anonymous_spider import LinkedInAnonymousSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

spider = LinkedInAnonymousSpider(None, "James", "Bond")
process = CrawlerProcess(get_project_settings())
process.crawl(spider) ## <-------------- (1)
process.start()

我发现在(1)中，process.crawl()正在创建另一个LinkedInAnonymousSpider，其中的first和last都是None（在(2)中打印出来），如果是这样，那么创建spider对象就没有意义，如何将参数first和last传递给process.crawl()呢？

linkedin_anonymous:

from logging import INFO

import scrapy

class LinkedInAnonymousSpider(scrapy.Spider):
    name = "linkedin_anonymous"
    allowed_domains = ["linkedin.com"]
    start_urls = []

    base_url = "https://www.linkedin.com/pub/dir/?first=%s&last=%s&search=Search"

    def __init__(self, input = None, first= None, last=None):
        self.input = input  # source file name
        self.first = first
        self.last = last

    def start_requests(self):
        print self.first ## <------------- (2)
        if self.first and self.last: # taking input from command line parameters
                url = self.base_url % (self.first, self.last)
                yield self.make_requests_from_url(url)

    def parse(self, response): . . .

- yusuf

4个回答

6

你可以简单地这样做：

from scrapy import cmdline

cmdline.execute("scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json".split())

- Manualmsdos

3

如果您已经安装了Scrapyd，并且想要调度爬虫，请执行以下操作：

使用curl命令发送POST请求，将项目名称(projectname)、爬虫名称(spidername)、以及其他参数(first和last)作为数据传递给http://localhost:6800/schedule.json。

- bonifacio_kid

0

试试这个：

import os

first_name = "DemoFirstName"
last_name = "DemoLastName"

os.system(f"""scrapy crawl linkedin_anonymous \
                      -a first={first_name} \
                      -a last={last_name} \
                      -o output.json""")

不要在=之间加入任何空格。

- hafiz031

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- eLRuLL · Accepted Answer

69

在process.crawl方法中传递蜘蛛参数：

process.crawl(spider, input='inputargument', first='James', last='Bond')

- eLRuLL

5

这样做，我们可能无法使用“-o output.json”通过吗？ - hAcKnRoCk

2

@hAcKnRoCk 在这里提供了如何配置输出文件的方法：https://dev59.com/7J_ha4cB1Zd3GeqPvj-I#42301595 - Anton Rodin

指定输出文件: process = CrawlerProcess(settings={"FEEDS": {"items.json"{"format": "json"},},}) - hafiz031

指定输出文件: process = CrawlerProcess(settings={"FEEDS": {"items.json"{"format": "json"},},}) - undefined