Using Proxies in LWP::UserAgent in Perl in 2024

Jan 9, 2024 ยท 5 min read

Proxies allow you to route and filter web requests, acting as a middleware between your code and target sites. They are invaluable for web scraping due to enabling rotation to prevent blocks. The LWP set of Perl libraries make working with proxies easy, but there are still pitfalls and gotchas that can trip you up. In this guide, I'll share techniques and lessons learned from years of experience using LWP::UserAgent with proxies for large-scale web scraping.

Why Proxies Matter for Web Scraping

Scraping even a few pages from a website using the same IP over and over can get you blocked. Sites try to detect bots and scraping activity through methods like:

  • Tracking the number of requests from an IP
  • Checking if the user agent string matches a real browser's
  • Analyzing request patterns
  • Using proxies rotates your IP on each request, making your scraper appear as different users and preventing blocks.

    Other proxy benefits include hiding your origin IP, and circumventing geographic restrictions.

    Configuring LWP::UserAgent to Use a Proxy

    LWP::UserAgent is the core component in Perl for making web requests. Here's how to point it at a proxy:

    1. Use Environment Variables

    Set the HTTP_PROXY or HTTPS_PROXY env vars to your proxy URL:

    $ENV{HTTP_PROXY} = '<http://192.168.1.42:8080>';
    $ua = LWP::UserAgent->new;
    $ua->env_proxy;
    

    This automatically picks up the proxy from the environment.

    2. The proxy() Method

    You can directly pass the proxy to use through proxy():

    my $ua = LWP::UserAgent->new;
    $ua->proxy('http', '<http://proxy.example.com:8080>');
    

    Also set protocols_allowed to enforce using the proxy:

    $ua->protocols_allowed(['http']);
    

    3. LWP::Protocol::connect Module

    For HTTPS proxies, LWP::Protocol::connect enables tunneling through CONNECT:

    use LWP::UserAgent;
    use LWP::Protocol::connect;
    
    my $ua = LWP::UserAgent->new;
    $ua->proxy('https', 'connect://x.x.x.x:8080');
    

    This allows seamless HTTPS proxying.

    Proxy Authentication

    Many proxies require authentication to access them.

    To pass credentials along, provide them in the proxy URL:

    $ua->proxy('http', '<http://user:password@proxy.com:8080>');
    

    Or use $ua->credentials():

    $ua->credentials("proxy.com:8080", "", "username", "password");
    

    This will automatically handle Proxy-Authentication-Required responses from the proxy.

    Making SSL / HTTPS Requests

    Proxies intercept all traffic, including encrypted HTTPS connections. This happens through a process called SSL Tunneling.

    The client first creates a normal HTTP connection to the proxy server. An HTTPS request is then tunneled through this connection via the HTTP CONNECT method.

    The proxy connects to the destination site, sets up encryption, then passes bytes unmodified between the client and server. This allows proxied HTTPS requests to retain full security.

    Certificate Verification

    By default LWP verifies the certificate when tunneling HTTPS.

    To disable this (unsafe):

    $ENV{PERL_LWP_SSL_VERIFY_HOSTNAME} = 0;
    

    Now LWP will connect regardless of cert trust, allowing tools like mitmproxy to intercept the HTTPS traffic emerging from the proxy. This is very useful for debugging the SSL tunnel.

    Subclassing Proxies

    For advanced custom proxy logic, subclass LWP::UserAgent:

    package MyProxyAgent;
    
    use base qw(LWP::UserAgent);
    
    sub get_basic_credentials {
      # Return creds dynamically
      my $creds = ...;
    
      return $creds;
    }
    
    sub new {
      my $class = shift;
      my $self = $class->SUPER::new(@_);
      $self->proxy('https:', '<http://myproxy.com>');
    
      return $self;
    }
    

    Any LWP::UserAgent method can be overridden.

    Common Issues and Debugging

    Here are some common errors and fixes when using LWP proxies:

    407 Proxy Authentication Required

    Your proxy needs authentication. Supply credentials.

    Error Connecting to Proxy

  • Check firewall settings allowing outbound connections
  • Verify proxy URL and port with netcat
  • HTTPS Sites Not Working

  • If ENV var PERL_LWP_SSL_VERIFY_HOSTNAME is set, try unsetting it
  • Use LWP::Protocol::connect for HTTPS over CONNECT
  • Proxy Not Used for Requests

  • Ensure protocols_allowed only has proxy-able schemes
  • Watch request traffic to confirm proxy is receiving it
  • Getting IP Blocked Despite Proxy

  • Rotate and increase proxies to distribute load
  • Disable custom LWP certificate verification logic
  • Use a rotating proxy service (more on this next)
  • To debug issues, it's crucial to monitor and log traffic at both the proxy itself and your Perl code, ensuring requests flow through your proxy as expected.

    Best Practices

    Follow these tips for smooth sailing with LWP proxies:

    Rotate User Agents

    Rotate user agent strings on each request to appear like different browsers:

    my @agents = ('Mozilla/5.0', 'Chrome/97.0'...);
    my $ua_string = $agents[rand @agents];
    $ua->agent($ua_string);
    

    Increase Proxy Ports

    Distribute load across multiple proxy server ports on the same IPs.

    Custom Proxy Subclass

    For advanced use cases, build a custom subclass handling proxies your way.

    Use LWP::Protocol::connect for HTTPS

    The connect module streamlines HTTPS proxies to just work using tunneling.

    Chain Multiple Proxies

    Route through multiple proxies in sequence to obscure origin.

    Wrapping Up

    LWP provides versatile options for routing your web scraper through proxies. Configuring it correctly shields your identifying details over long scraping jobs.

    However, managing large proxy pools and staying ahead of blocks takes experience. For most developers, a managed, auto-rotating proxy service handles these complexities turnkey.

    Our Proxies API service provisions millions of residential IPs to rotate each request. With built-in User-Agent rotation, CAPTCHA solving, and custom whitelabeling, it lets you focus efforts on parsing scraped data.

    We offer 1000 free API calls to get started. Check it out at https://proxiesapi.com.

    Frequently Asked Questions

    What is the default timeout for LWP UserAgent?

    The default timeout is 180 seconds. You can change it with:

    $ua->timeout(60); # 60 seconds
    

    Where to set API timeout?

    Set timeouts on the UserAgent instance:

    my $ua = LWP::UserAgent->new(timeout => 20);
    

    What is API timeout?

    This is the timeout applied to all API calls before they are aborted.

    How do I fix a 408 request timeout?

    Increase the timeout configured on $ua, ensure you have no firewall issues, and retry the request.

    How do I add timeout to API?

    Pass the timeout parameter when constructing your UserAgent:

    my $ua = LWP::UserAgent->new(timeout => 60);
    

    Now API calls made with this UA will timeout after 60 seconds.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!