Solving CAPTCHAs with OpenAI's Whisper Using Selenium

Oct 4, 2023 ยท 5 min read

Have you ever gotten frustrated trying to solve those pesky CAPTCHAs? CAPTCHAs are used to tell humans and bots apart on the internet, but the audio versions can be particularly tricky to decipher. However, new developments in AI speech recognition models like OpenAI's Whisper offer an intriguing way to automate solving CAPTCHAs programmatically.

In this article, we'll explore a method for using Whisper with Selenium to solve audio CAPTCHAs automatically. We'll go through the key steps and provide code samples so you can try it yourself.

Overview

First, let's briefly introduce the main technologies we'll be using:

What is OpenAI Whisper?

Whisper is an extremely powerful speech recognition model released by AI research company Anthropic. It can transcribe audio with human-level accuracy, even with limited computing resources. This makes it ideal for automated speech transcription use cases like solving audio CAPTCHAs.

What is Selenium?

Selenium is an open source web automation tool. It allows you to control web browsers through scripts and automate interactions like clicking buttons, filling forms, and more. We'll use it to automate fetching and submitting the CAPTCHA audio.

Step-by-Step Process

Now, let's go through the step-by-step process for using Whisper and Selenium to solve CAPTCHAs:

1. Click the CAPTCHA Checkbox

Use Selenium to locate the CAPTCHA checkbox on the web page and programmatically click it. This initiates the CAPTCHA challenge.

# Locate captcha checkbox
checkbox = driver.find_element_by_id('captchaCheckbox')

# Click checkbox
checkbox.click()

2. Click the Audio Icon

Next, locate the audio icon button and click it to request the audio version of the CAPTCHA.

# Locate audio icon
audio_icon = driver.find_element_by_id('captchaAudioIcon')

# Click audio icon
audio_icon.click()

3. Download the CAPTCHA Audio File

Selenium will allow us to retrieve the src for the CAPTCHA audio file. We can then download the file to our local system.

# Get captcha audio source
audio_src = driver.find_element_by_id('captchaAudioSource').get_attribute('src')

# Download captcha audio file
import requests
response = requests.get(audio_src)
with open('captcha.mp3', 'wb') as f:
  f.write(response.content)

4. Pass Audio into Whisper

Now comes the key step! We'll load the audio into Whisper to transcribe it.

import whisper

model = whisper.load_model("base")

result = model.transcribe("captcha.mp3")

Whisper will return the text transcription, which should contain the CAPTCHA solution!

5. Submit CAPTCHA Solution

Finally, we simply take the transcribed text from Whisper and use Selenium to fill in and submit the CAPTCHA.

# Enter captcha solution into input
captcha_input = driver.find_element_by_id('captchaTextbox')
captcha_input.send_keys(result['text'])

# Submit captcha
submit_btn = driver.find_element_by_id('captchaSubmitButton')
submit_btn.click()

And that's it! By leveraging Whisper and Selenium, we've successfully automated solving audio CAPTCHAs.

Full Code

For reference, here is the full code:

import selenium
from selenium import webdriver
import requests
import whisper

driver = webdriver.Chrome()

# Click checkbox
checkbox = driver.find_element_by_id('captchaCheckbox')
checkbox.click()

# Click audio icon
audio_icon = driver.find_element_by_id('captchaAudioIcon')
audio_icon.click()

# Get captcha audio source
audio_src = driver.find_element_by_id('captchaAudioSource').get_attribute('src')

# Download captcha audio file
response = requests.get(audio_src)
with open('captcha.mp3', 'wb') as f:
  f.write(response.content)

# Load Whisper model
model = whisper.load_model("base")

# Transcribe audio
result = model.transcribe("captcha.mp3")

# Enter captcha solution
captcha_input = driver.find_element_by_id('captchaTextbox')
captcha_input.send_keys(result['text'])

# Submit
submit_btn = driver.find_element_by_id('captchaSubmitButton')
submit_btn.click()

This provides an automated end-to-end pipeline for defeating audio CAPTCHAs using Whisper's state-of-the-art speech recognition capabilities paired with Selenium for automation.

Conclusion

In this article, we walked through a method to leverage OpenAI's powerful new Whisper speech model to solve audio CAPTCHAs automatically using Selenium. The accuracy of Whisper enables transcribing the audio captcha quickly and correctly.

While captcha solving bots should be used ethically, this demonstration shows the potent capabilities of modern AI speech recognition and how it may impact captcha design in the future. Automating tedious manual processes like audio captcha transcribing is just one example use case for this emerging deep learning technology.

While these tools are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

Browse by tags:

Browse by language:

Tired of getting blocked while scraping the web?

ProxiesAPI handles headless browsers and rotates proxies for you.
Get access to 1,000 free API credits, no credit card required!