Building a Web Crawler? Here Are All The Places That It Will Probably Fail At

May 7th, 2020

Here is a list of places that your web crawler will probably fail at. You will need to build in checks for each and also expect them to happen. Send yourself alerts by having portions of the scripts check for unexpected behavior.

If your web crawler is stuck, you need to know

If your web crawler is slowing down, you need to know

if you are having internet issues, you need to know

if the data you are getting is weird, you need to know

Use can also use external tools like these to help keep them up.

Log all the steps your web crawler is taking and the time it took for each. Build in a check where your code sends you an alert when the time has taken is too long and if it 'knows' the data that should be fetched, but it is not fetched this time.

Here is a list of places that you need to pay special attention to in your code to prevent breakages

We the web pages dont load

Internet is down

When the content at the URL has moved

You are shown a CAPTCHA challenge.

The web page changes its HTML, so your scraping doesn't work.

Some fields that you scrape are empty some of the time, and there is no handler for that.

The web pages take a long time to load

The web site has blocked you completely

Get our articles in your inbox