如何向Scrapy管道对象传递参数

7
在使用Scrapy爬虫获取一些数据之后:
class Test_Spider(Spider):

    name = "test"
    def start_requests(self):
        for i in range(900,902,1):
            ........
            yield item

我把数据传递给一个管道对象,使用SQLAlchemy将其写入SQLite表中:
class SQLlitePipeline(object):

    def __init__(self):
        _engine = create_engine("sqlite:///data.db")
        _connection = _engine.connect()
        _metadata = MetaData()
        _stack_items = Table("table1", _metadata,
                             Column("id", Integer, primary_key=True),
                             Column("detail_url", Text),
        _metadata.create_all(_engine)
        self.connection = _connection
        self.stack_items = _stack_items

    def process_item(self, item, spider):
        is_valid = True

我希望能够将表名设置为一个变量,而不是像现在一样硬编码为"table1"。如何实现这个功能?
3个回答

11

假设您通过命令行传递该参数(例如-s table="table1"),请定义一个from_crawler方法。

@classmethod
def from_crawler(cls, crawler):
    # Here, you get whatever value was passed through the "table" parameter
    settings = crawler.settings
    table = settings.get('table')

    # Instantiate the pipeline with your table
    return cls(table)

def __init__(self, table):
    _engine = create_engine("sqlite:///data.db")
    _connection = _engine.connect()
    _metadata = MetaData()
    _stack_items = Table(table, _metadata,
                         Column("id", Integer, primary_key=True),
                         Column("detail_url", Text),
    _metadata.create_all(_engine)
    self.connection = _connection
    self.stack_items = _stack_items

非常感谢。表格可以从蜘蛛内部设置吗? - user1592380
是的,你应该在 open_spider 内打开连接而不是 __init__。(https://doc.scrapy.org/en/latest/topics/item-pipeline.html#open_spider) - lucasnadalutti

6
一个更简单的方法是将参数传递给:
scrapy crawl -a table=table1

然后使用spider.table获取值:

class TestScrapyPipeline(object):
    def process_item(self, item, spider):
        table = spider.table

5
class SQLlitePipeline(object):

    def __init__(self, table_name):

        _engine = create_engine("sqlite:///data.db")
        _connection = _engine.connect()
        _metadata = MetaData()
        _stack_items = Table(table_name, _metadata,
                             Column("id", Integer, primary_key=True),
                             Column("detail_url", Text),
        _metadata.create_all(_engine)
        self.connection = _connection
        self.stack_items = _stack_items

    @classmethod
    def from_crawler(cls, crawler):
        table_name = getattr(crawler.spider, 'table_name')
        return cls(table_name)

使用from_crawler,您可以根据指定的参数创建或实例化一个管道对象。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接