May 7th, 2020
8 Ways Web Servers Can Tell Your Web Crawler Apart From Humans

While it is fun to wrangle with machine learning solutions that can solve CAPTCHAs, If you do a lot of things well especially, proxy rotation from a large enough pool of proxies, you will rarely face this problem. This is because, unless they are behind a login screen (and a paywall), it doesn't make sense to interrupt users with CAPTCHAs randomly as it amounts to bad user experience. So mostly CAPTCHAs are made to appear after the webserver has some way of counting down the number of times a particular user has accessed the service.

In our own experience with hundreds of clients at Proxies API, we have found this to be true if you impersonate multiple humans in other areas than in solving CAPTCHAs.

Namely, these areas. This is largely how web servers can tell you are not human.

  • Your browser's signature (User-Agent-String) remains the same during multiple web scraping requests.
  • Your IP address remains the same, and so gets blocked.
  • You change your IP to a proxy but don't change your browser's identity (User-Agent-String)
  • You dont have enough proxies so gradually, all of them get blocked.
  • You are making too many requests per second.
  • You are too regular in your requests.
  • You are not sending back cookie data and other small clues.
  • You are unable to solve CAPTCHAs.

If you are looking for a professional and scalable to permanently solve most of the 8 "tells" I have listed above, you can have a look at Proxies API. We have a running offer of 1000 API fetches completely free :-)

Share this article:

Get our articles in your inbox

Dont miss our best tips/tricks/tutorials about Web Scraping
Only great content, we don’t share your email with third parties.
Icon