In this comprehensive guide, I'll share everything I've learned after years of wrestling with stubborn sites. You'll walk away with battle-tested code snippets, insider knowledge on bypassing tricky bot protections, and an intuitive understanding of how to handle async JS rendering.
The Curse of Client-Side Rendering
In the early days of the web, pages were simple affairs rendered entirely by servers. Want to grab some data? Send a request and parse the HTML. Easy peasy.
Some clues this is happening:
So what do we do? We need browsers!
Browser Automation with Selenium
Before we start scraping, we need to install the key Python libraries:
pip install selenium
Just add this code to load up an instance:
from selenium import webdriver
driver = webdriver.Chrome()
Now you can find elements and extract data like usual:
links = driver.find_elements_by_tag_name("a")
for link in links:
The main downside is that Selenium is slooow. Browsers are resource-hungry beasts. Performance degrades rapidly when scraping at scale.
We'll cover some optimization techniques later. But first, let's look at a lighter-weight option.
First install it…
pip install requests-html
from requests_html import HTMLSession
session = HTMLSession()
r = session.get("<http://example.com>")
Now the HTML will contain any dynamically rendered content!
Under the hood, Requests-HTML uses Pyppeteer to run an instance of headless Chrome. So it's much faster than Selenium since we skip the overhead of a visible browser.
Let's look at some examples:
Waiting for Pages to Load
Sometimes you need to wait for content to load before scraping:
data = r.html.render(script="return document.title")
Any variables or data structures returned will be accessible in Python.
Crawling Paginated Content
For looping through pages of content, we can do:
for page in r.html:
It will automatically click "Next" buttons and cycle through pages.
Optimization and Scaling
Once you've built an initial scraper with Requests-HTML or Selenium, it's time to optimize performance. Here are some pro tips:
Mastering these techniques takes time but pays dividends when scraping at scale.
Bypassing Bot Protections
An entirely separate skill is avoiding bot protections. Some tips:
This cat and mouse game never ends as sites deploy new protections. But with enough tricks up your sleeve, you can scrape most pages undetected.
Despite our crafty techniques, some pages just aren't meant to be scraped.
Steer clear if sites:
It's better to look for alternatives than waste time fighting an uphill battle.
Knowing when to fold 'em is an important skill!
We covered everything from picking the right tools to optimization, scalability, and sneaky anti-bot tricks.
Scraping complex sites is challenging, but extremely rewarding when you pull out hidden data through sheer persistence.
How do I handle pages that require logging in?
For pages behind a login wall, use Selenium to automate entering credentials and clicking buttons. Save logins in a config file - don't hardcode crdentials!
How can I speed up Selenium browsers?
Tips for faster Selenium scraping include:
What's the difference between client-side and server-side rendering?
Is it legal to scrape websites without permission?