The Perfect Web Crawling Stack

Jan 7th, 2021

Here is a freewheeling fantasy list of things I would have in the perfect web crawling/web scraping stack.

1. I would use Scrapy to write spiders.

2. Amazon t3.xlarge EC2 servers running on 4 cores with 16 GB RAM on their High-Speed data network.

3. Maxed out concurrency for each spider considering I will be using a Rotating Proxy Network like Proxies API.

4. Use Scrapy’s Signals module to capture and record the status of each spider into a MySQL database.

5. A simple PHP interface to Query the records and also start, pause, and Cancel Spiders.

6. Scrapyd to run multiple spiders at a time.

7. CSS Selectors to scrape the data.

8. Data exported to JSON and stored on Amazon S3 using Scrapy’s inbuilt support for both.

The author is the founder of Proxies API the rotating proxies service.