複数のspiderで異なるpipelineを通す - あぼぼーぼ・ぼーぼぼ

scrapyは使用するpipelineを全て定義する必要があり、普通に書くとどのspiderでも定義した全てのpipelineを通るようになっている。これをそれぞれのspiderで、指定したpipelineだけを通るようにする実装のメモ。

例として、slack_botとtweet_botの2つのspiderを定義する。

import scrapy


class SlackBotSpider(scrapy.Spider):
    name = 'slack_bot' # この値を判定につかう

    def parse(self, response):

import scrapy


class TweetBotSpider(scrapy.Spider):
    name = 'tweet_bot' # この値を判定につかう

    def parse(self, response):

slack_botとtweet_botそれぞれ専用のpipelineを1つ、共通のpipelineを1つ定義したとすると、以下のように専用のpipelineのprocess_item内で、spider.nameの値をチェックすれば良い。

class SlackPipeline:
    def process_item(self, item, spider):
        if spider.name not in ['slack_bot']:
            return item

        print('SlackPipeline')
        yield item

~~~省略~~~

class TweetPipeline:
    def process_item(self, item, spider):
        if spider.name not in ['twitter_bot']:
            return item

        print('TweetPipeline')
        yield item

~~~省略~~~

class SaveFilePipeline:
    def process_item(self, item, spider):
        print('SaveFilePipeline')
        yield item

settings.pyには全てのpipelineを列挙する必要がある。

ITEM_PIPELINES = {
    'my_crawler.pipelines.SlackPipeline': 300,
    'my_crawler.pipelines.TweetPipeline': 400,
    'my_crawler.pipelines.SaveFilePipeline': 500,
}

これで、slack_bot実行時はSlackPipeline->SaveFilePipeline、tweet_bot実行時はTweetPipeline->SaveFilePipelineを通すことができる（厳密には全てのpipelineのprocess_item()は呼ばれるので、処理をスキップすると言った方が正しい）。

$ scrapy crawl slack_bot
SlackPipeline
SaveFilePipeline

$ scrapy crawl tweet_bot
TweetPipeline
SaveFilePipeline

参考：https://groups.google.com/d/msg/scrapy-users/msKQ7UaYh_E/ee8WSMPRpq0J