Scraping Hidden Emails with Python Web Scraping

Feb 3, 2024 ยท 2 min read

Email addresses are often hidden on websites to avoid spam bots. But sometimes you need to contact someone and can't easily find their email. This is where Python web scraping can help uncover those hidden emails.

We'll use the BeautifulSoup library to parse HTML and the re module to find email patterns.

Inspecting the Page

First, we'll use the browser's inspector to examine the page and find potential emails. Often they're obfuscated in the HTML or JavaScript.

import requests
from bs4 import BeautifulSoup
import re

url = "http://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

Now search through the HTML to find email-like patterns. Emails contain @ symbols and domain names.

Writing the Regex

We can write a regex to match common email formats. This handles the username, @ symbol, domain extensions like .com, and other patterns.

email_regex = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""

emails = re.findall(email_regex, response.text)

This scans the entire page text and extracts anything matching the email pattern.

Handling Javascript

For JavaScript heavy sites, requests won't execute the JavaScript. We can use Selenium to drive a browser which will render the full page including JavaScript. This allows finding emails loaded dynamically.

Web scraping takes trial and error. Inspecting the pages, writing regular expressions, and handling JavaScript can uncover those hidden email contacts. With some Python and perseverance, you can find what you need.

Browse by tags:

Browse by language:

Tired of getting blocked while scraping the web?

ProxiesAPI handles headless browsers and rotates proxies for you.
Get access to 1,000 free API credits, no credit card required!