Managing cURL HTTP Redirects

As an experienced web scraper, I've had to untangle my fair share of HTTP redirect headaches. Redirects might seem simple on the surface, but when a site starts bouncing your scraper through multiple hops or stripping credentials unexpectedly, things get nasty quick!

In this guide, I'll draw on painful debugging wars from the scraping trenches to unlock the true arts of managing redirects with cURL. Read on, grasshopper, and you too shall gain redirect mastery!

Why Redirects Matter for Scraping

First, let's quickly define the beast we're dealing with. An HTTP redirect is when a site responds to your scraper's request by saying "Hey, don't look for that page here! It's actually moved over there now, go ask that URL instead."

Redirects come in two main flavors:

Permanent redirects (301 status code) - The content now lives permanently at a shiny new URL. Please update your bookmark!

Temporary redirects (302 status code) - The content is temporarily located elsewhere, but may move back here later. No need to update your bookmark!

Now you might be wondering...why should my scraper care about any of this redirect nonsense when fetching pages? Can't it just silently follow along to the new URLs?

Well, here are three redirect-related scraping nightmares I've run into many times:

1. Redirect Loops - The site keeps bouncing your request between URLs, eventually hitting a max redirect limit and failing. Crafty!

2. Lost Credentials - You pass logged-in cookies or headers with the initial request, but they mysteriously vanish along the way when hitting other sites. Not good!

3. Changed Request Methods - You carefully craft a POST request, but get back empty 302 responses instead. Turns out redirects quietly switch POST to GET! Sneaky!

So in summary, redirects deserve respect, grasshopper! Ignore them at your peril...

Which brings us to our friend cURL, who we shall now train in the subtle arts of HTTP redirect mastery!

cURL's Default Redirect Behavior

Let's start by getting clear on what cURL does by default when it hits redirects, since it's pretty bare bones:

It does NOT follow redirects automatically (unlike browsers)
It DOES limit redirects to a 50 hop maximum to avoid endless loops

So for example, if I request http://small-site.com and get back a 301 Permanent Redirect to http://big-fancy-site.com, cURL will just hand me back the raw 301 response body without visiting the destination URL. Not super helpful usually!

To actually start following redirects, we need to call on the mighty -L flag, like so:

curl -L <http://small-site.com>

Now cURL will dutifully follow along to http://big-fancy-site.com until it finally gets back a normal 200 OK response. Much better!

But this is still only the beginning of our redirect journey...

Setting Custom Redirect Limits

That default maximum of 50 redirects seems reasonable enough at first glance. But what if some playful sysadmin decides that 52 redirects is the perfect number for messing with scrapers? By default, our requests will simply fail once that magical 51st redirect appears!

Thankfully, we can take back control by customizing the max limit like so:

# Set higher max redirect limit of 100 hops
curl -L --max-redirs 100 <http://small-site.com>

# Allow infinite redirects (may trigger loop trap!)
curl -L --max-redirs -1 <http://small-site.com>

With scrapers, I'll usually start reasonably high at 100 or so, keeping infinity as a risky last resort when all else fails. Because like an accidental Zen koan, sometimes the only way out of infinite redirects...is through infinite redirects!

Retaining the Original HTTP Method

Now pay close attention grasshopper, as this one drives developers mad!

When following redirects with -L, cURL automatically changes POST requests to GET on 301, 302 and 303 redirect responses by default. This means silently dropping something like:

curl -X POST -d "my data" -L <http://site.com>

So instead of resending that nice POST request to the redirected URL, cURL annoyingly switches to sending an empty GET instead! Not usually what you want.

We can explicitly force cURL to retain the original method with:

# Retain POST on 301 redirects
curl -L --post301 -X POST -d "data" <http://site.com>

# Retain POST on 302 redirects
curl -L --post302 -X POST -d "data" <http://site.com>

I wasted about 3 hours debugging why my carefully crafted POST submission kept mysteriously failing before discovering this gem. So consider yourself warned!

Forwarding Credentials on Redirects

Last but not least, we come to redirects stripping out credentials, yet another subtle beast to contend with.

By default, cURL sends cookies, bearer tokens or authentication headers only on the very first request of a redirect chain. So internally, it may merrily bounce along via 5+ redirects before noticing:

"Wait, where did those original credentials go? Surely the user didn't intend to anonymously fetch this last page! Sadness!"

We can tell cURL to keep forwarding credentials with the --location-trusted flag:

curl -L --location-trusted -H "Authorization: Bearer mytoken" <http://site.com>

But here we must tread carefully! Explicitly trusting all redirects means even questionable ones could steal our precious credentials. So only enable this when working with trusted hosts, or after carefully sniffing the full redirect path.

And with that, we come to the end of our journey into cURL redirect mastery. We've conquered duplicate request methods, vanishing credentials, and even infinite loops!

Frequently Asked Questions

Q: How can I log full redirect traces for testing?

A: Add the -v flag to any cURL command to output verbose logging, including each redirect hop taken to reach the final URL. Very handy for debugging!

Q: My redirects work fine in Postman but not cURL, what gives?

A: By default, Postman will automatically follow redirects, while cURL requires explicitly enabling them with -L or -follow-location. So double check that flag exists if things mysteriously break!

Q: What status codes should indicate a permanent vs temporary redirect?

A: The original HTTP 1.0 standard called out 301 for permanent, and 302 for temporary redirects. But in later versions, 308 also emerged as an option for "permanent...ish" redirects when 301 felt too extreme. So modern sites pick status codes somewhat arbitrarily!

Managing cURL HTTP Redirects

Why Redirects Matter for Scraping

cURL's Default Redirect Behavior

Setting Custom Redirect Limits

Retaining the Original HTTP Method

Forwarding Credentials on Redirects

Frequently Asked Questions

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Managing cURL HTTP Redirects

Why Redirects Matter for Scraping

cURL's Default Redirect Behavior

Setting Custom Redirect Limits

Retaining the Original HTTP Method

Forwarding Credentials on Redirects

Frequently Asked Questions

The easiest way to do Web Scraping

Don't leave just yet!