Using Proxies with Ruby's Open-URI for Web Scraping in 2024

Jan 9, 2024 ยท 4 min read

For Ruby scrapers, open-uri makes fetching and parsing pages breeze. But it still suffers from blocks without proxies!

Let's look at how to configure proxies for use with open-uri.

Specifying Proxies in Open-URI

The open method in open-uri accepts a :proxy option to route requests via a proxy:

proxy_url = '<http://username:password@proxy.example.com:8000>'

open('<https://page.to.scrape/>', proxy: proxy_url) { |f|
  # scrape page
}

We simply pass the proxy URL including any auth details. This uses the built-in Net::HTTP::Proxy behind the scenes.

For authenticated proxies, we must pass the credentials separately:

proxy_url = '<http://proxy.example.com:8000>'
username = 'proxyuser'
password = 'proxypass'

open('<https://page.to.scrape/>',
  proxy_http_basic_authentication: [proxy_url, username, password]
) { |f|
  # scrape page
}

This allows using proxies that require authentication.

To disable proxies, we pass a falsey value:

open('<https://page.to.scrape/>', proxy: false) { |f|
  # scrape page with no proxy
}

Underlying proxy environment variables still apply by default.

Leveraging Environment Variables

Open-uri respects standard proxy environment variables out-of-the-box:

  • http_proxy
  • https_proxy
  • ftp_proxy
  • no_proxy
  • For example:

    export http_proxy="<http://proxy.example.com:8000>"
    ruby -ropen-uri -e "open('<http://page.to.scrape>') {...}"
    

    This proxies all HTTP requests made with open-uri.

    Note: The capitalized versions like HTTP_PROXY work too.

    no_proxy provides a workaround from using the proxy for specific domains.

    So environment variables provide an easy mechanism for bulk proxy configuration.

    Working With HTTP Proxies

    While open-uri supports both HTTP and SOCKS proxies, additional options are available for HTTP proxies specifically.

    For example, we can configure timeouts:

    open('<https://page.to.scrape/>',
      read_timeout: 10, # seconds
      open_timeout: 5  # seconds
    )
    

    This helps avoid stalled requests getting stuck when using proxies.

    For HTTPS requests:

    open('<https://page.to.scrape/>',
      ssl_ca_cert: '/path/to/ca.cert' # custom cert
    )
    

    Passing a custom CA cert may be required if the proxy uses a self-signed certificate for inspection.

    Redirects can also be configured:

    open('<http://page.to.scrape>', redirect: true) # handle redirects (default)
    open('<http://page.to.scrape>', redirect: false) # disable redirects
    

    This handles scenarios where the proxied IP receives different redirects compared to a direct IP.

    Authentication and Authorization with Proxies

    Web scraping with proxies also needs special care around authentication and authorization.

    Open-uri provides a :http_basic_authentication option:

    open('<https://page.to.scrape>',
      http_basic_authentication: ['username', 'password']
    )
    

    This handles HTTP basic auth with the target site's credentials.

    For proxy authentication, we covered earlier the proxy_http_basic_authentication option. This uses the supplied proxy username and password.

    A common mistake is confusing site auth vs proxy auth! Be sure to use the right credentials in the right place.

    Advanced Proxy Usage with Open-URI

    Open-URI provides some lesser known options to ease proxy usage for your scraper.

    Monitor download progress with a callback:

    progress_proc = -> (size) do
      puts "Downloaded #{size} bytes"
    end
    
    open(url, progress_proc: progress_proc)
    

    Or get total size before downloading:

    length_proc = -> (content_length) do
      puts "Total size: #{content_length} bytes"
    end
    
    open(url, content_length_proc: length_proc)
    

    Streaming response bodies is also possible with a bit of work. This enables processing page content as it downloads via the proxy instead of after.

    We can build failure handling by wrapping proxy requests in a retrying mechanism. This lessens issues with poor proxies going bad.

    Common Errors and Troubleshooting Tips

    Here are some frequent proxy errors along with troubleshooting suggestions:

    407 Authentication Required - Use the correct proxy credentials via proxy_http_basic_authentication.

    Connection reset by peer - The proxy server cut off mid-request. Try a different proxy or check for issues with your network/firewall.

    SSL certificate verify failed - Pass the CA cert file to validate self-signed certs from the proxy when using HTTPS.

    cef_filter_peer_reset - This obscure Chrome DevTools error indicates Chrome detected the request coming from a proxy and blocked it. Use proxy inspection tools to validate requests are proxying properly with the expected headers, SSL certificates etc.

    Too many redirects - Adjust the redirect option based on if the proxy alters redirects compared to direct requests.

    Ruby's superb net-http-cheat-sheet has more handy debugging tips relevant for proxying.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!