Persistent Headers for Slick Web Scraping with Python Requests Sessions

Oct 22, 2023 ยท 4 min read

As a seasoned web scraper, I've learned that HTTP headers are the duct tape holding together your fragile scraping scripts. They identify your client, control caching, and help avoid detection. Crafting the right headers makes scraping feel effortless, while the wrong ones lead to frustration and failure.

That's why I always use request sessions when scraping with Python. They let you setup default headers just once, then apply them persistently across all your requests. No more repeating the same header code endlessly!

Sessions are magic, but you need to understand how they work to get the most out of them. In this guide, I'll share my hard-earned knowledge for using sessions effectively to handle headers.

Creating Persistent Scraping Sessions

First, import Requests and instantiate a session:

import requests
session = requests.Session()

This session will let us carry over headers between requests simply by using session instead of requests for all our calls:

session.get(url, headers=headers)

But even better, we can set default headers on the session itself. These will automatically apply to every request through the session:

session.headers.update(headers)

Default Headers - The Scraping Essentials

When scraping, I like to set a few headers on every session:

User Agent - I rotate between Chrome, Firefox, Safari and Edge user agents to appear human. Browsers reject unfamiliar clients.

Accept Language - Setting languages like en-US helps sites serve the content you expect.

Referer - Populating the referer header fools sites into serving assets as if you came from a normal page view.

Accept Encoding - Scraper-friendly sites will gzip responses when you advertise gzip support, saving bandwidth.

Other Headers - Depending on the site, you may need headers like Host, Origin, or Content-Type too.

Authentication - Staying Logged In

Many sites require login before accessing content. Sessions let you login once then keep accessing authorized pages:

session.auth = ('username', 'password')

This will add Authorization or other needed authentication headers to all requests automatically.

For APIs, you may have to pass OAuth tokens or custom authentication headers. Sessions simplify reusing these too:

token = authenticate_and_get_token()
session.headers['Authorization'] = f'Bearer {token}'

Dynamically Changing Headers

While default headers are great for boilerplate needs, we often have to tweak headers dynamically per request.

For example, scraping links sequentially needs the Referer header updated constantly. Or randomizing user agents may require picking a new one per request.

No problem - headers passed directly to a request will override the session defaults:

user_agent = random_user_agent()
response = session.get(url, headers={'User-Agent': user_agent})

My one caution is that session headers remain unchanged for future requests. So you have to update headers each time you need new values.

Header Order Matters

Unlike Python dicts, HTTP headers have an order. So make sure any headers you need to come first get added first.

For example, appending Accept-Encoding last may prevent gzipping if the server honors the first encoding it finds.

One pattern I follow is starting sessions with default headers, then appending conditional request-specific ones after.

Debugging Headers

Sometimes scraping fails mysteriously due to headers you didn't expect or realize were missing.

To debug, log requests through the session to check headers:

import logging

logger = logging.getLogger('scraper')
logger.setLevel(logging.DEBUG)

logging_hook = {'response': logger.debug,
                'request': logger.debug}

session.hooks.update(logging_hook)

You can also explicitly print headers of responses and requests - super useful for debugging!

Advanced Scraping Patterns

Beyond the basics, there are powerful patterns leveraging sessions and headers for robust scraping:

  • Randamizing Values - Varying headers like user agents, referers, and device types helps avoid blocks.
  • Chain Scraping - Pass a scraped link in the referer to scrape assets linked from pages you visit.
  • Session Hooks - Trigger logic based on received headers to handle scrapers traps.
  • Stateful Scraping - Track session state like logins in headers to customize scraping logic per-user.
  • And much more! Sessions are the backbone enabling advanced workflows.

    While mastering scrapers takes experience, sessions and headers give you fined-grained control of HTTP traffic. Learn them well and you'll be able to scrape less like a skiddie and more like a pro!

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!