Feb 8th, 2021

5 Rules for Writing a Web Crawler That Doesn't Break


1. Use a framework like Scrapy

Don't try to reinvent the wheel. Frameworks like Scrapy abstract many of the complex functions of web scraping like concurrency, rate limiting, handling cookies, extracting links, using file pipelines, handling broken and different encoding to make life easier. Starting from scratch on your own dooms your project to the inevitable complexity that is going to hit you pretty soon.

Photo by Scrapy Architecture overview — Scrapy 2.0.1 documentation

2. Learn & Use XPath or CSS selectors.

Instead of using RegEx or any other custom rudimentary method to get to the data, you want to scrape, using CSS selectors or XPath or a combination of both makes your code more stable. It protects you against arbitrary changes in a website’s code.

3. Scale using Scrapyd and Rotating Proxies

Scrapyd allows you to run multiple spiders at the same time and manage them easily. Combining it with a rotating proxy means you can scale your project to dramatic speeds and break a lot of usage and concurrency restrictions of linear coding without incurring the wrath of usage restrictions or IP blocks.

4. Take measures to counter usage restrictions and IP blocks

Rotating proxies like Proxies API is the way to do it. There is no way around it for serious projects of any decent size, frequency, and importance.

5. Put in checks and balances

There are so many failure points in a web crawling projects that you have no control over. It’s best to put it a bunch of checks and balances by first identifying them like:

a. Loss of internet connectivity on both ends.

b. Usage restrictions imposed.

c. IP blocks imposed.

d. The target website changes its HTML.

e. The target website is down.

f. Target website issues a CAPTCHA challenge.

The author is the founder of Proxies API, a proxy rotation API service.

Share this article:

Get our articles in your inbox

Dont miss our best tips/tricks/tutorials about Web Scraping
Only great content, we don’t share your email with third parties.
Icon