The Redirect Ninja's Guide to Mastering Python Requests

Oct 31, 2023 ยท 6 min read

As an experienced web scraper, you've likely encountered your fair share of redirects. While scraping a site, you request a URL only to be redirected to a different page.

At first, these redirects seemed like a nuisance. Your script would crash or get stuck in an endless loop, unable to handle the unexpected redirection.

But over time, you learned to master redirects with Python's powerful Requests module. You even picked up a few insider tricks along the way.

In this guide, I'll share everything I've learned for foolproof redirect handling. We'll start from the basics then level up to advanced techniques.

Ready to become a redirect ninja? Let's dive in.

Follow that Redirect!

The first step is understanding how to simply follow a redirect.

By default, Requests will not follow redirects from the initial URL. So if you request http://example.com and get redirected to https://www.example.com, your response will still contain data from the original http://example.com.

To follow redirects, we need to explicitly enable them with the allow_redirects parameter:

import requests

response = requests.get('<http://example.com>', allow_redirects=True)
print(response.url)
# Prints out <https://www.example.com>

Setting allow_redirects=True tells Requests to automatically handle any redirects by following the new URL.

This works for POST, PUT, and other request types too. Just add the same allow_redirects=True parameter to send request data through redirects.

Smarter Sessions

But what if our script needs to make many requests? Opening and closing connections for every call is inefficient.

This is where Sessions come in handy:

session = requests.Session()

session.get('<http://example.com>', allow_redirects=True)
# ...make more requests...

Sessions let us persist settings like cookies and header values across requests.

We can also configure them for smarter redirect handling:

session = requests.Session()
session.config['strict_redirects'] = False

response = session.get('<http://example.com>')

With strict_redirects=False, the session will preserve POST data through 302 redirects rather than converting them to GET requests. Very useful!

Custom Redirect Handlers

For ultimate control, we can create custom redirect handlers with the HTTPRedirectHandler:

import urllib3
http = urllib3.PoolManager()
redirectHandler = urllib3.HTTPRedirectHandler()

http.add_redirect_handler(redirectHandler)

Now we can subclass HTTPRedirectHandler and override the redirect_request method to customize handling for different redirect codes:

class MyRedirectHandler(urllib3.HTTPRedirectHandler):

  def redirect_request(self, req, fp, code, msg, hdrs, newurl):

    # Custom logic here
    return super().redirect_request(req, fp, code, msg, hdrs, newurl)

myHandler = MyRedirectHandler()
http.add_redirect_handler(myHandler)

For example, we could change the request method on 301 redirects. The possibilities are endless!

Inspecting Redirects

Once you've enabled redirect handling, you'll likely want to inspect what redirections occurred under the hood.

The response.history attribute contains a list of any intermediate responses from redirects:

response = requests.get('<http://example.com>', allow_redirects=True)

print(response.history)
# [<Response [301]>]

We can also print out each previous URL and status code like so:

for resp in response.history:
  print(f"{resp.status_code} - {resp.url}")

Finally, response.url contains the final URL after any redirects. Useful for confirming it matched expectations.

Beware the Infinite Loop

One infamous redirect gotcha is the infinite redirect loop, where a URL gets caught bouncing between pages.

To avoid crashes, we can set the max_redirects parameter to a safe limit like 10:

response = requests.get('<http://example.com>', max_redirects=10)

If exceeded, Requests will raise a TooManyRedirects exception instead of endlessly redirecting. Phew!

For debugging, enabling the Requests logger can help identify any problematic redirect chains.

Redirect Considerations

There are a few other redirect-related factors to keep in mind:

  • Authentication: Use response.history to re-authenticate if a 401 Unauthorized response appears.
  • Fragments: The initial URL fragment isn't preserved by default. Add headers like Redirect-Fragment: true to keep fragments.
  • HTTPS: Redirects from HTTP to HTTPS may cause SSL certificate issues depending on configurations.
  • POST Data: Enable strict_redirects=False to avoid converted POST > GET requests losing data.
  • Proxy: Set proxy headers like X-Forwarded-Host so redirects go through proxy rather than directly.
  • Mastering these nuances takes practice, but pays dividends for reliable scraping.

    Alternative: Urllib

    Before you get too comfortable with Requests, it's worth noting the built-in urllib modules can also handle redirects.

    The urllib.request module in Python 3 has a HTTPRedirectHandler that enables following redirects.

    However, Requests tends to offer a simpler and more Pythonic interface. Unless you need ultra-fine control, Requests is likely the better choice.

    Common Redirect Questions

    Here are some common redirect-related questions for reference:

    Q: How do I stop/prevent redirects in Requests?

    A: Set allow_redirects=False in the request.

    Q: Why am I getting "Too many redirects" errors?

    A: Add a max_redirects limit like 10 to avoid endless loops.

    Q: Should I use 307 or 308 code for temporary redirects?

    A: 307 is more widely supported, 308 is semantically a bit clearer.

    Q: How do I redirect POST data or cookies in Flask/Django?

    A: Use redirect(url, code=302) to send 302 redirects that preserve POST/cookie data.

    Q: How do I inspect previous URLs from redirects?

    A: Check response.history for intermediate responses and status codes.

    Key Takeaways

    To recap, the key skills for redirect mastery include:

  • Enabling redirects with allow_redirects=True.
  • Using sessions and custom handlers for advanced control.
  • Inspecting response.url and response.history.
  • Setting max_redirects to avoid infinite loops.
  • Handling authentication, fragments, POST data, and other nuances.
  • Master these techniques, and no redirect will faze you again!

    For next steps, practice redirect scenarios to get hands-on experience. And feel free to reach out with any other redirect questions.

    Happy redirect ninja training!

    FAQ

    Q: How do I permanently redirect in Python?

    A: Return a 301 Moved Permanently status code like redirect(url, code=301).

    Q: Why am I getting SSL errors after redirect?

    A: Make sure SSL verification is configured correctly. Or try verify=False to ignore SSL errors.

    Q: How can I redirect from HTTP to HTTPS in Flask?

    A: Detect the scheme and redirect if needed:

    from urllib.parse import urlparse
    
    @app.before_request
    def before_request():
      if request.url.startswith('http://'):
        url = request.url.replace('http://', 'https://')
        return redirect(url, code=302)
    

    Q: How do I redirect back to a URL with query parameters?

    A: Parse the URL with urllib.parse then reconstruct it:

    from urllib.parse import urlparse, urlencode
    
    @app.route('/redirect')
    def redirect_back():
      url = urlparse(request.url)
      query = urlencode(dict(url.query))
      url = f"{url.path}?{query}"
      return redirect(url)
    

    This preserves the original query parameters.

    Q: Can I redirect from an API view in Django?

    A: Yes, use HttpResponseRedirect:

    from django.http import HttpResponseRedirect
    
    def my_view(request):
      url = '/new/url/'
      return HttpResponseRedirect(redirect_to=url)
    

    Just return the response object.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!