Dodging CAPTCHAs with Python for Web Scraping

Oct 4, 2023 ยท 3 min read

CAPTCHAs are one of the biggest annoyances when scraping the web. Those squiggly letter and image puzzles are designed to halt bots in their tracks. But with the right approach, you can sneak past CAPTCHAs undetected.

In this article, we'll use Python libraries to automatically solve CAPTCHAs so you can focus on extracting the data you want.

The Problem with CAPTCHAs

Websites use CAPTCHAs as a way to distinguish humans from bots. CAPTCHAs act as a roadblock that can grind your scraping efforts to a frustrating halt.

Some common CAPTCHA implementations include:

  • reCAPTCHA (text and image challenges)
  • hCaptcha
  • FunCAPTCHA
  • These challenges are effective at stopping most scrapers in their tracks. But there are still ways around them with Python.

    Automatically Solving CAPTCHAs with 2Captcha

    To bypass CAPTCHAs in our scraper, we can leverage API-based CAPTCHA solving services like 2Captcha.

    2Captcha has a large network of human solvers that can quickly solve CAPTCHAs via its API. This allows us to integrate real-time CAPTCHA solving into our scripts.

    Here's an example using 2Captcha with Python:

    import undetected_chromedriver as uc
    from twocaptcha import TwoCaptcha
    
    driver = uc.Chrome()
    driver.get("<https://example.com>")
    
    # Get CAPTCHA site-key
    sitekey = driver.find_element(by='id', value='captchaElement').get_attribute('data-sitekey')
    
    # Setup 2Captcha API
    api_key = '2CAPTCHA_API_KEY'
    solver = TwoCaptcha(api_key)
    
    # Solve CAPTCHA
    print("Solving captcha...")
    response = solver.recaptcha(sitekey=sitekey, url=driver.current_url)
    
    # Submit solution
    driver.execute_script("document.getElementById('g-recaptcha-response').innerHTML='"+response['code']+"';")
    

    We use undetected-chromedriver to avoid bot detection while navigating to the target page.

    2Captcha handles solving the CAPTCHA behind the scenes and returns the CAPTCHA solution code. We inject this into the page to bypass the challenge.

    This allows us to scrape uninterrupted without having to manually solve endless CAPTCHAs!

    Conclusion

    By incorporating 2Captcha or similar services, you can easily bypass even the toughest CAPTCHAs when scraping.

    Just be sure to follow a website's robots.txt directives and terms of service. Automating CAPTCHA solving can be controversial if done excessively on certain sites.

    With the techniques covered here, you'll be prepared to scrape intelligently at scale and overcome one of the top bot detection methods on the web.

    Rather than building and managing your own captcha solving infrastructure, services like Proxies API handle all of this complexity for you.

    With Proxies API, you make a simple API request with the target URL. It will handle:

  • Rotating proxies and IP addresses
  • Rotating user agents
  • Solving captchas
  • Running JavaScript
  • And return the rendered HTML. No need to orchestrate the numerous steps required for reliable captcha solving.

    For example:

    curl "http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://targetpage.com"
    

    This takes care of all the headaches of automation. No proxies, browsers, or captcha solving services to manage.

    Proxies API offers 1000 free API calls to get started. Check it out if you need to integrate robust captcha solving and proxy rotation in your projects.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!