A practical, production-ready solution for running scrapers as Celery tasks
Some time ago, I accepted a project to build a portal for commercial users to scrape product details by submitting links to the product listings. The portal was a site built entirely with Python, and Scrapy.
As scraping pages typically takes time, it is often best to run them in the background and keep the users notified of its progress from time to time.
If you’re reading this, chances are, you are already familiar with Scrapy and/or Celery. In case you’re new to Scrapy, it is an open-source framework for us to write scrapers to extract structured data from websites.
Celery, in short, is a commonly-used, well-supported, distributed task queue in Python that allows us to run long-running tasks in the background by a pool of dedicated workers. Think of Celery as the mail processing system in your local post office. The local branch (Celery) takes your mails (tasks) and distributes them to the staff (workers) to deliver them.
In this article, I’ll be writing about:
- Why you should run scrapers without using the boilerplate code Scrapy generates, and how?
- Reading stats after the execution of your scrapers
- Running scrapers in Celery tasks
- How to prevent Celery from terminating a long-running scraping task
- How to avoid the notorious
ReactorNotRestartableexception, and an optional section about the technical details behind it
- How to scrape without exposing your IP address, and how to prevent your scrapers from being blocked
(I’ll be using the terms “crawler” and “scraper” interchangeably.)
If you’re following the tutorial from the official documentation, you’ll be instructed to use the boilerplate code Scrapy generates using
scrapy startproject my_project_name . While this provides some code structure for you to build your scrapers, not all the bells and whistles are necessary if your scrapers are intended to be run as Celery tasks.
If you’re building a web app that uses Celery tasks to crawl data like I did, the project structure Scrapy generates might not even fit nicely with the current structure of your web app. Thus, it might be better to tailor your scraper to run in a script.
The alternative to using the boilerplate project Scrapy provides is to run it from a script with Scrapy Crawler API.
The latest official documentation demonstrates running Scrapy crawlers using
CrawlerProcess provides an easy-to-use interface to run crawlers within a script and is also used internally by Scrapy CLI (
scrapy command in your Terminal). Nonetheless, the code example given is rather limited if you wish to access the underlying data collected by the Spiders.
Scrapy Spiders collect insightful stats and produce meaningful logs. I’m certain you have seen some of the logs above. Pay closer attention to the screenshot above, you’ll find useful information including a number of items scraped, etc.
In my project, I have to inform the users about the number of items scraped upon completion of each task. Being able to access the stats generated is essential. However, using
CrawlerProcess to run your spiders can be tricky for us to access the stats collected.
Hence, I recommend using the low-level API
From the code example,
run_scrapy() creates a
Crawler, starts crawling using
MySpider, and log the number of items scraped.
I covered the topic on why and how to run your scrapers within a script using
Crawler as opposed to
CrawlerProcess to enable us to access the data collected by the Spiders.
There are many ways to define a Celery task. For simplicity, I’ll define it as shown:
You might be curious about why I use
multiprocessing.Process() within the Celery task. It prevents
ReactorNotRestartable() exception which I will describe more about in a later section.
By default, Celery terminates a task if it cannot be completed within a prescribed time limit. To prevent your Scrapy crawlers from being terminated, you should lengthen the time limit. To do so, simply pass the time limits as keyword arguments as such:
If you’re not familiar with the difference between soft and hard limit, feel free to check out this section of the official documentation.
In one of the sections above, I use
multiprocessing.Process() to run the scraper in a separate process. The rationale is to avoid the
I’ll write about the technical details that cause the issues in a separate section. For now, let’s focus on the solution.
To fix it, we need to change the concurrency model Celery employ. By default, Celery uses pre-fork model when creating workers. We have to change it to multi-threading.
To do so, when starting
celery workers , add an argument
-P threads to start the workers in multi-threading mode:
celery -A my_project worker -P threads
Running celery workers in multi-threading mode, combining with
multiprocess.Process(). ReactorNotRestartable shouldn’t be a problem anymore.
[Optional] Technical details
Feel free to skip this part if you prefer not to dive into the details that cause
ReactorNotRestartable and the rationale behind the solution.
ReactorNotRestartable is a by-product of these:
- Scrapy uses Twisted underlying. Reactor is part of Twisted, and it is the core of how scrapers are run. When a crawling process finishes, the reactor is shut down. Once a reactor is shut down, it cannot be restarted.
- Celery uses pre-fork model for its workers by default. This means workers are created before any tasks, and they being reused as a task is completed.
The fact that Twisted holds its state globally, and is retained after each task, as workers are being reused. Once the Reactor is shut down, there is no way to restart is unless a new worker is created.
My approach to this problem is to ensure a new Reactor is created for each task, which is why I use
multiprocessing.Process() to create a separate process to run the crawlers. (A new process = a new Reactor)
However, it is not possible to create a fork a child process from a worker if the worker is a child process itself (as it is created via pre-fork mode). That is why I have to instruct Celery to create workers as separate threads, instead of processes (
Of course, this is not the panacea either, if you’re familiar with the limitation of threads due to the notorious GIL, but this is more than sufficient in my case.
Modern websites employ techniques to identify scrapers and block them immediately. It is very likely that your IP address will be blocked permanently by the sites once being detected as a robot. To circumvent this, I personally use ScraperAPI. It is a proxy service with a large pool of IP addresses from distributed locations. It comes with a pip package that can be easily installed and used. My experience with it has been quite pleasant. If you decided to give it a try, feel free to use the promo code: SCRAPE1933074
In this article, I discussed how to run Scrapy crawlers in Celery tasks with solutions to some of the catches you may encounter.
The project has been completed now and my clients have been using it happily for many months without any issues. I hope you’ll find this article useful.
I appreciate any comments and suggestions. 😃