I'm currently scrapping a website using Scrapy 0.24. The website has the following url format:
www.site.com?category={0}&item={1}&page={2}
I have a MySQLStorePipeline which is responsible for storing each scrapped item in the database. But I have 80 categories, 10 items and 15 pages, which results in 80 * 10 * 15 = 120000 pages.  Each page I yield 25 scrapy.Items, which give us 25 * 120000 = 3000000 rows in the database.
So, every time the pipeline receives an item, it inserts into the database. And it is not a smart way. I'm looking for a way to buffer the pipeline items and, for example, when we receive 1000 items, execute a bulk insert. How can I achieve that?
Have the pipeline store items in a list, and insert them when they reach a certain length, and on spider closing.
class Pipeline(object):
    def __init__(self):
        super(Pipeline, self).__init__()
        self.items = []
    def process_item(self, item, spider):
        self.items.append(item)
        if len(self.items) >= 1000:
            self.insert_current_items()
        return item
    def insert_current_items(self):
        items = self.items
        self.items = []
        self.insert_to_database(items)
    def close_spider(self, spider):
        self.insert_current_items()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With