Proxies allow you to route and filter web requests, acting as a middleware between your code and target sites. They are invaluable for web scraping due to enabling rotation to prevent blocks. The LWP set of Perl libraries make working with proxies easy, but there are still pitfalls and gotchas that can trip you up. In this guide, I'll share techniques and lessons learned from years of experience using LWP::UserAgent with proxies for large-scale web scraping.
Why Proxies Matter for Web Scraping
Scraping even a few pages from a website using the same IP over and over can get you blocked. Sites try to detect bots and scraping activity through methods like:
Using proxies rotates your IP on each request, making your scraper appear as different users and preventing blocks.
Other proxy benefits include hiding your origin IP, and circumventing geographic restrictions.
Configuring LWP::UserAgent to Use a Proxy
LWP::UserAgent is the core component in Perl for making web requests. Here's how to point it at a proxy:
1. Use Environment Variables
Set the
$ENV{HTTP_PROXY} = '<http://192.168.1.42:8080>';
$ua = LWP::UserAgent->new;
$ua->env_proxy;
This automatically picks up the proxy from the environment.
2. The proxy() Method
You can directly pass the proxy to use through
my $ua = LWP::UserAgent->new;
$ua->proxy('http', '<http://proxy.example.com:8080>');
Also set
$ua->protocols_allowed(['http']);
3. LWP::Protocol::connect Module
For HTTPS proxies, LWP::Protocol::connect enables tunneling through CONNECT:
use LWP::UserAgent;
use LWP::Protocol::connect;
my $ua = LWP::UserAgent->new;
$ua->proxy('https', 'connect://x.x.x.x:8080');
This allows seamless HTTPS proxying.
Proxy Authentication
Many proxies require authentication to access them.
To pass credentials along, provide them in the proxy URL:
$ua->proxy('http', '<http://user:password@proxy.com:8080>');
Or use
$ua->credentials("proxy.com:8080", "", "username", "password");
This will automatically handle Proxy-Authentication-Required responses from the proxy.
Making SSL / HTTPS Requests
Proxies intercept all traffic, including encrypted HTTPS connections. This happens through a process called SSL Tunneling.
The client first creates a normal HTTP connection to the proxy server. An HTTPS request is then tunneled through this connection via the HTTP CONNECT method.
The proxy connects to the destination site, sets up encryption, then passes bytes unmodified between the client and server. This allows proxied HTTPS requests to retain full security.
Certificate Verification
By default LWP verifies the certificate when tunneling HTTPS.
To disable this (unsafe):
$ENV{PERL_LWP_SSL_VERIFY_HOSTNAME} = 0;
Now LWP will connect regardless of cert trust, allowing tools like mitmproxy to intercept the HTTPS traffic emerging from the proxy. This is very useful for debugging the SSL tunnel.
Subclassing Proxies
For advanced custom proxy logic, subclass
package MyProxyAgent;
use base qw(LWP::UserAgent);
sub get_basic_credentials {
# Return creds dynamically
my $creds = ...;
return $creds;
}
sub new {
my $class = shift;
my $self = $class->SUPER::new(@_);
$self->proxy('https:', '<http://myproxy.com>');
return $self;
}
Any LWP::UserAgent method can be overridden.
Common Issues and Debugging
Here are some common errors and fixes when using LWP proxies:
407 Proxy Authentication Required
Your proxy needs authentication. Supply credentials.
Error Connecting to Proxy
HTTPS Sites Not Working
Proxy Not Used for Requests
Getting IP Blocked Despite Proxy
To debug issues, it's crucial to monitor and log traffic at both the proxy itself and your Perl code, ensuring requests flow through your proxy as expected.
Best Practices
Follow these tips for smooth sailing with LWP proxies:
Rotate User Agents
Rotate user agent strings on each request to appear like different browsers:
my @agents = ('Mozilla/5.0', 'Chrome/97.0'...);
my $ua_string = $agents[rand @agents];
$ua->agent($ua_string);
Increase Proxy Ports
Distribute load across multiple proxy server ports on the same IPs.
Custom Proxy Subclass
For advanced use cases, build a custom subclass handling proxies your way.
Use LWP::Protocol::connect for HTTPS
The connect module streamlines HTTPS proxies to just work using tunneling.
Chain Multiple Proxies
Route through multiple proxies in sequence to obscure origin.
Wrapping Up
LWP provides versatile options for routing your web scraper through proxies. Configuring it correctly shields your identifying details over long scraping jobs.
However, managing large proxy pools and staying ahead of blocks takes experience. For most developers, a managed, auto-rotating proxy service handles these complexities turnkey.
Our Proxies API service provisions millions of residential IPs to rotate each request. With built-in User-Agent rotation, CAPTCHA solving, and custom whitelabeling, it lets you focus efforts on parsing scraped data.
We offer 1000 free API calls to get started. Check it out at https://proxiesapi.com.
Frequently Asked Questions
What is the default timeout for LWP UserAgent?
The default timeout is 180 seconds. You can change it with:
$ua->timeout(60); # 60 seconds
Where to set API timeout?
Set timeouts on the
my $ua = LWP::UserAgent->new(timeout => 20);
What is API timeout?
This is the timeout applied to all API calls before they are aborted.
How do I fix a 408 request timeout?
Increase the
How do I add timeout to API?
Pass the
my $ua = LWP::UserAgent->new(timeout => 60);
Now API calls made with this UA will timeout after 60 seconds.