Why Every Programmer Should Learn How To Scrape The Web

Jan 17th, 2021

As a web crawler, the web is a scary place. It is like landing a spaceship on an asteroid. It is because the original web browsers like Internet Explorer were very lenient on poorly written HTML, so many of the pattern matching techniques keep failing. You learn to appreciate technologies like Beautiful Soup to help tame rouge HTML.

2. It helps you understand web servers.

Web servers use several ‘tells’ to detect that you are not human. It is a bit like a game of Cat and Mouse.

3. It might teach you a new language

If you want pure power, sooner or later, you will realize the power of Scrapy, especially when combined with Scrapyd. A long, long time ago, I had to learn Python to get all the advantages Scrapy gives me, which is easily offset by the disadvantages of having to learn something new.

4. It tests your ability to think out of the box

Each website is different, has AJAX-based content; pagination comes in all shapes and sizes. You have to ‘crack’ each website separately many times as there is no one size fits all.

5. It challenges you to work at scale

What works when you are in the development mode very often crash lands when you run it at scale. To use concurrency, scale spiders, handle large amounts of data, maximize the throughput of a server or several servers, use rotating proxies to overcome IP blocks like Proxies API. Coding is only the first step.

6. It allows you to tame the chaos

It is what most of your code does. You are trying to get data out of something that wasn’t built for your bots. Most of the time, it is not easy to domesticate them.

7. It teaches you to work with frameworks

Scrapy on Scrpayd along with file pipelines if you have to download images. CSS/XPath Selectors to make scraping more predictable. Rate limiting, session handling, and concurrency — You will learn all this by trial and error.

8. It teaches you the importance of creating checks and balances

Once you have deployed the code, a single rogue javascript might bring down your project. The website changes the HTML, and the code may no longer work. Your machine has no network, but if there is no check for this, you will end up with several thousand empty documents. You will learn to anticipate, put in checks and alerts to immediately know, and debug if there is something wrong in your workflow.

The author is the founder of Proxies API, a proxy rotation api service.

Get our articles in your inbox