Here is a list of places that your web crawler will probably fail at. You will need to build in checks for each and also expect them to happen. Send yourself alerts by having portions of the scripts check for unexpected behavior.
- If your web crawler is stuck, you need to know
- If your web crawler is slowing down, you need to know
- if you are having internet issues, you need to know
- if the data you are getting is weird, you need to know
Use can also use external tools like these to help keep them up.
Log all the steps your web crawler is taking and the time it took for each. Build in a check where your code sends you an alert when the time has taken is too long and if it 'knows' the data that should be fetched, but it is not fetched this time.
Here is a list of places that you need to pay special attention to in your code to prevent breakages
- We the web pages dont load
- Internet is down
- When the content at the URL has moved
- You are shown a CAPTCHA challenge.
- The web page changes its HTML, so your scraping doesn't work.
- Some fields that you scrape are empty some of the time, and there is no handler for that.
- The web pages take a long time to load
- The web site has blocked you completely