Feb 6th, 2021

5 Rules For Writing A Web Scraper That Doesn’t Break


1. Use a crawling and scraping framework like Scrapy:

Don’t try to reinvent the wheel. Frameworks like Scrapy abstract many of the complex functions of web crawling like concurrency, rate limiting, handling cookies, extracting links, using file pipelines, handling broken and different encoding to make life easier.

Scrapy also makes life easier by providing the support of selectors for scraping content.

2. Learn & Use XPath or CSS selectors

Instead of using RegEx or any other custom rudimentary method to get to the data, you want to scrape, using CSS selectors or XPath or a combination of both makes your code more stable. It protects you against arbitrary changes in a website’s code.

3. Scale using Scrapyd and Rotating Proxies

Scrapyd allows you to run multiple spiders at the same time and manage them easily. Combining it with a rotating proxy like Proxies API means you can scale your project to dramatic speeds and break a lot of usage and concurrency restrictions of linear coding without incurring the wrath of usage restrictions or IP blocks.

4. Take measures to counter usage restrictions and IP blocks

You might have finally written the perfect scraper that gets every piece of information, managed pagination, code variances, javascript rendering, etc. but it might all come to naught if you get IP blocked. Rotating proxies like Proxies API are the way to do overcome it. There is no way around it for serious projects of any decent size, frequency, and importance.

5. Put in checks and balances

There are so many failure points in web crawling projects that you have no control over. It’s best to put it a bunch of checks and balances by first identifying them like:

a. Loss of internet connectivity on both ends.

b. Usage restrictions imposed.

c. IP blocks imposed.

d. The target website changes its HTML.

e. The target website is down.

f. Target website issues a CAPTCHA challenge.

g. The target website changes the rules of pagination.

h. The target website uses cookies now.

i. Target website hides content behind javascript.

The author is the founder of Proxies API, a proxy rotation API service.

Share this article:

Get our articles in your inbox

Dont miss our best tips/tricks/tutorials about Web Scraping
Only great content, we don’t share your email with third parties.
Icon