Speeding up Python Requests using gzip and other techniques

Dec 6, 2023 ยท 8 min read

Requests. Everyone uses it. The ubiquity of Python's famous HTTP library is matched only by the ubiquity of complaints about how darn slow it can be.

I learned this lesson first hand when I was building a high volume web scraper to pull data from different SaaS APIs and crunch analytics on usage. When I started stress testing my scraper, I found request latency becoming a major bottleneck. Pages that took milliseconds to load when browsing started taking seconds when scraped! My scraper throughput dropped 10x from thousands of requests per minute to hundreds. Not great.

What gives? The core issue is that unlike a browser casually loading the occasional page, scrapers pump out requests relentlessly. That puts strain on connections, SSL handshakes, and anything that adds overhead. It didn't take long for my scraper to hit limits and pile up queues of pending requests.

I knew I had to optimize. But where to start? There are so many tweaks and techniques for speeding up Requests. I tried them all out, with some trial and error, and managed to accelerate my scraper by over 5x. In this post, I'll share everything I learned to help you squeeze out every last drop of speed from your Python requests.

Connection Management Hacks for Lower Latency

The first area I optimized was connection management. I wanted my scraper to set up connections faster and reuse them as much as possible. Here are some techniques that worked wonders:

Connection Pooling

Opening a brand new connection for every request is slow. There's DNS lookup, SSL negotiation, and more that adds milliseconds of latency. Connection pooling is a no brainer solution - maintain a pool of ready connections to reuse for new requests.

The Requests Session object can handle this automatically:

session = requests.Session()

response1 = session.get('<http://example.com>')
response2 = session.get('<http://example.com>') # reuses connection!

This shaved nearly 100ms off each request time in my testing. Just be sure to watch out for configuring the right pool size - too small and you bottleneck on open connections, too large and resources get wasted on idle sockets.

Keep-Alive Connections

Even better than reusing connections from a pool is keeping them alive between requests. This avoids the overhead of fully closing and reopening connections. I configured keep alive to stay open for 5+ seconds:

session.keep_alive = 5 # seconds

This worked great for APIs I was hitting frequently, but use caution - leaving lots of connections open can backfire. Make sure to test for resource exhaustion and tune this setting based on traffic patterns.

Async Requests

Taking advantage of async requests was a game changer for maxing out throughput. By using aiohttp I could have 10, 100, even 1000 requests in flight simultaneously. No more waiting around!

import aiohttp

async with aiohttp.ClientSession() as session:
  responses = await asyncio.gather(
    session.get('<http://example.com/1>'),
    session.get('<http://example.com/2>'),
  )

Of course, this introduces complexity in managing callbacks and concurrency. I ran into scrapped data races and weird debug issues until I really understood async well. But the 5-10x throughput gains are worth it!

Compressing Requests for Faster Data Transfer

After optimizing connections, my next focus was reducing data transfer size to speed up requests. I enabled compression using Gzip and Brotli algorithms:

headers = {'Accept-Encoding': 'gzip'}

response = requests.get('https://api.example.com>', headers=headers)
if response.headers.get('Content-Encoding') == 'gzip':
  data = decompress(response.content) # decompress gzip

This compressed API response sizes by 60-70% typically. The lower data transfer really improved round trip times.

I could go further with more advanced compression algorithms like Zstandard and Brotli at the cost of higher CPU usage. In my testing the extra compression savings weren't worth the decompress slowdown, but your mileage may vary. Play around with what works best for your specific traffic.

Caching and Concurrency - Friends with Benefits

Leveraging caching and concurrency together can provide compounded performance improvements.

For caching, I added a simple decorator that stored responses in a local Redis cache:

@cached(expire=60)
def get_api_data():
  return requests.get('<http://example.com>')

# Subsequent calls fetch from cache
data1 = get_api_data()
data2 = get_api_data() # cache hit!

This eliminated duplicate requests for unchanged data. But even better, I could combine caching with async concurrency:

import asyncio
cache = ExpiringDict(max_len=100, max_age_seconds=60)

async def fetch(session, url):
  if cache.get(url):
    return cache[url] # cache hit

  response = await session.get(url)
  cache[url] = response # cache miss
  return response

# Fetch 100 URLs concurrently
urls = [url1, url2, ..., url100]
loop = asyncio.get_event_loop()
data = loop.run_until_complete(fetch_all(session, urls))

Now my scraper could rapidly check the cache for results while kicking off async fetches for missing cache entries all at the same time. The combined effect was multiplicative - caching * concurrency = blazing fast.

Fine Tuning Configuration Options

To eke out every bit of performance, I even looked at fine tuning some Requests configuration options:

  • Timeouts - Lower timeouts reduced wait time for unresponsive servers
  • SSL Verification - Disabling SSL validation skipped certificate checks
  • Pool Sizes - Higher pools prevented connection starvation
  • Like any optimization, more is not always better - disabling SSL when scraping sensitive data could be dangerous for example. Only tweak configurations after thorough testing and risk evaluation. Used judiciously though, even small config changes stacked up for noticeable gains.

    Going Asynchronous for Concurrent Requests

    Earlier I mentioned using asynchronous requests to increase throughput. Let's dive deeper into this technique.

    The aiohttp library makes it easy to fire off multiple asynchronous requests in parallel:

    import aiohttp
    import asyncio
    
    async def fetch_async(url):
      async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
          return await response.text()
    
    urls = [url1, url2, url3]
    
    async def main():
      tasks = []
      for url in urls:
        task = asyncio.create_task(fetch_async(url))
        tasks.append(task)
    
      await asyncio.gather(*tasks)
    
    asyncio.run(main())
    

    The key is asyncio.gather - it concurrently waits for all the async tasks to complete.

    This can speed up programs with I/O bottlenecks. But it requires understanding async coding patterns. Beware of race conditions! Test rigorously.

    Optimizing DNS Lookup Performance

    DNS lookup latency can add up when making thousands of requests.

    Try switching to a faster DNS resolver like aiodns:

    import aiodns
    
    resolver = aiodns.DNSResolver()
    
    async def resolve(domain):
      return await resolver.gethostbyname(domain)
    
    

    Alternatively, most operating systems let you configure a faster global DNS server.

    On Linux:

    /etc/resolv.conf
    nameserver 1.1.1.1 # Use Cloudflare's DNS resolver
    

    Faster DNS lookups shave precious milliseconds on each domain resolution.

    Leveling Up Performance Beyond Python

    The techniques covered so far were Python-specific optimizations with Requests. But I found additional major speedups by looking outside Python - at the load balancer and infrastructure layers.

    A simple nginx caching proxy in front of my scraper accelerated requests by handling compression so my Python code didn't have to. A CDN like Cloudflare also cached responses and reduced round trips.

    Combining application improvements with infrastructure can compound performance gains. With enough attention, I was able to accelerate my web scraper by over 10x from its initial slow state.

    The Never Ending Battle for Speed

    Optimizing Python requests and response handling is a deep topic. This post only scratches the surface of techniques like connection management, compression, concurrency, caching, configuration, and encoding that can speed up requests.

    There is no silver bullet - only incremental gains from deliberate optimizations based on your specific bottlenecks. Testing and measurement is key to know where to focus efforts.

    While performance tuning is a never ending battle, the payoff is worth it. The difference between a sluggish and a blazing fast scraper directly impacts how much data you can crunch and insights you can extract. Spending time to shave off every millisecond can be a competitive advantage.

    Hopefully these tips give you a head start on speeding up your Python requests. Happy optimizing.

    FAQs

    Q: Is gzip better than zip for compression?

    A: Gzip is optimized for compressing text and JSON data. It has faster compression and decompression speeds compared to zip which makes it better suited for network transfer. Zip is more generalized compression focused on file archives.

    Q: How do I enable gzip compression in Nginx?

    A: You can enable gzip compression in Nginx by adding the gzip directive. For example:

    gzip on;
    gzip_types text/plain text/html text/css application/json;
    

    This will compress responses for the specified content types.

    Q: What Python library is best for async requests?

    A: The aiohttp library is a great choice for async HTTP requests and integrates well with asyncio. It provides an API similar to requests but with async functionality.

    Q: Is disabling SSL verification dangerous?

    A: Disabling SSL certificate validation compromises security protections. Only do this after a careful risk assessment, for development environments, or when scraping sites you fully trust. Many scrapers require valid certificates to avoid MITM attacks.

    Q: How can I implement response caching in Python?

    A: Libraries like cached-requests and cachecontrol make it easy to add caching wrappers around requests in Python. Redis and Memcached can be used as the cache storage backend.

    Q: What is the benefit of a CDN for APIs?

    A: A CDN can cache API responses at nodes closer to users, reducing latency. It also protects against traffic spikes and provides DDoS protection. For public APIs, a CDN is highly recommended.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!