Copy of Web Crawling- Where Do I Start?

May 7th, 2020

This is an excellent question. Because well begun is half the battle.

The choice of framework depends on what your immediate needs are, the kind of website you need to scrape, and the limitations they have will make a big difference as well.

If you have no real compulsion there, I am always tempted to say Scrapy.

The Scrapy Framework provides all sorts of abstraction for common crawling issues like concurrent requests, using multiple spiders using Scrapyd, Managing crawling rules, Obeying site constraints like rate limiting, managing data with pipelines, and finally support for XPath and CSS selectors for scraping the data quickly.

But if you are dealing with a web site like TripAdvisor and you want to scrape their reviews, or ironically, even Quora, you will find that most of their content is loaded using AJAX calls so that Scrapy won't cut it. You will need to use a headless browser to render the javascript and then scrape the content. Puppeteer is the best option for this. It allows you to control an instance of Chromium using Node JS fully.

Between these two options, you are pretty well set. But production-level web scraping is mostly not about coding. It is about reliability. If you run a webcrawler or any scale and you want to run it frequently, and the data is mission-critical, you will find that most web servers are savvy enough to detect, warn and block your crawler quite easily.

YELPPPP. I mean HELPPPP

To overcome this, you will need to use a proxy service and code it in such a way you rotate between a few of them now and then. There are some places on the internet that you get a list of active proxies free like here https://www.proxynova.com/proxy-server-list/, or you can go for a professional Rotating Proxy API Service like Proxies API. Full disclosure: I am the founder of Proxies API.

Get our articles in your inbox