Using Rotating Proxies in rvest in 2024

Jan 9, 2024 ยท 10 min read

Configuring Proxies in rvest

The key library powering most R based crawlers and scrapers is rvest. Luckily it makes setting up proxies quite straight-forward by building off of R's base http communication library httr.

Let's walk through a simple example:

# Load libraries
library(rvest)
library(httr)

# Authenticated proxy url
my_proxy <- '<http://user:pass@123.45.6.7:8080>'

# Set this proxy for http requests
httr::set_config(httr::use_proxy(my_proxy))

# Test it
webpage <- read_html("<http://httpbin.org/ip>")

> print(webpage)
<html>
 <head></head>
<body>
  <pre>
  {
    "origin": "123.45.6.7"
  }
  </pre>
</body>
</html>

By setting the proxy url directly in httr using set_config(), it automatically gets applied to all requests sent via functions like read_html() in rvest without any extra work!

Some key pointers on aspects of the proxy url:

  • Prefix with http:// or https:// based on proxy support
  • Include authentication credentials in the format user:password@ if required
  • The IP followed by port through which requests get routed
  • You can confirm it works by reading a page that shows the accessing IP like we just did with httpbin.org.

    Next let's look at some alternate ways to configure proxies in rvest beyond just using a URL directly.

    Setting Environment Variables

    An approach I prefer for easier management is specifying proxies via environment variables. This keeps proxy credentials separate and avoids exposing IPs in code - especially useful when working collaboratively.

    Here is how to configure environment variables for an authenticated HTTP proxy server:

    # Set proxy environment variables
    Sys.setenv(http_proxy = "<http://123.45.6.7:8080>")
    Sys.setenv(http_proxy_user = "username:password")
    
    # Confirm env variables
    Sys.getenv()
    

    Now rvest will automatically pick up these env variables when sending requests without needing to pass anything explicitly.

    You can also set a no_proxy environment variable to disable proxying for certain hosts and IP ranges if needed.

    Using Separate Proxy Lists

    Another flexible approach is maintaining a separate list of proxies and cycling through them to distribute requests.

    Let's see this in action:

    # List of proxies
    proxies <- data.frame(
      ip = c("123.45.6.7", "98.76.54.3"),
      port = c(8080, 8080),
      username = c("user1", "user2"),
      password = c("", "pass#8")
    )
    
    # Function to retrieve proxy config
    get_proxy_config <- function() {
    
      # Select random proxy
      proxy <- sample_n(proxies, 1)
    
      # Build proxy url
      url <- paste0("http://",
                    ifelse(proxy$username == "",
                      "", paste(proxy$username, proxy$password, "@")),
                    proxy$ip, ":",
                    proxy$port)
    
      # Set proxy for httr
      httr::set_config(httr::use_proxy(url))
    
    }
    
    # Usage
    get_proxy_config()
    webpage <- read_html("<http://httpbin.org/ip>")
    

    The key advantages being:

  • All proxies defined in one place making updates easy
  • Mix of authenticated and non-authenticated proxies
  • Credentials securely abstracted from code
  • Get different IP on every function call for distribution
  • Now that you understand how proxies can be configured for rvest in different ways, let's move on to an even more vital technique - rotating proxies dynamically for best results.

    Why You Need to Rotate Proxies for Web Scraping

    If proxies are essential to distribute scraping traffic across multiple IPs, wouldn't sticking to just a handful be enough?

    Unfortunately, in my early experiments I found even proxy servers get blocked eventually if you hit sites hard enough!

    The solution? Rotate amongst hundreds or even thousands of proxies automatically as you gather data. Let's dissect why this is critical:

    1. Prevent Proxy Blocks

    With even a pool of 4-5 proxies, repeating hits from the same small IP list allows sites to profile and block them. Rotation fundamentally defeats this.

    I've rotated amongst over 50K residential IPs simultaneously for months of continuous usage without tripping defenses on some large sites!

    2. Improve Success Rates

    Not all proxies work consistently, and many cheap ones fail often for various reasons. By dynamically picking only WORKING proxies for each request, success rates improve dramatically.

    3. Adjust Location Targeting

    Need to extract content from the Japan catalog of an ecommerce store? Or compare pricing across EU?

    Rotating geo-targeted proxies lets you intelligently switch location context between requests drawing from a world-wide residential IP pool.

    Clearly, for any serious scraping activity, automatically rotating amongst a large, reliable proxy pool is almost mandatory nowadays.

    Manually checking and handling dead proxies can become nightmarish pretty quickly. Let's look at smart ways to programmatically rotate proxies using R.

    Implementing Intelligent Proxy Rotation in rvest

    Rotating proxies manually between requests may seem simple enough by just randomly picking from a populated list. But in high volume scraping dealing with fails, blocks and weighting location priorities can get tricky fast.

    Let's construct a robust algorithm supporting dynamic rotation:

    Step 1 - Prepare Proxy List

    # Load proxy list from file
    proxies <- read.csv("proxies.txt", header = FALSE)
    proxies <- data.frame(
      ip = proxies$V1,
      port = proxies$V2
    )
    

    I maintain a frequently updated proxy txt file collating free and paid proxies from multiple sources into a standard IP:Port format.

    Even if initially all are marked as working, many invariably fail when actually used. The next steps filter these out.

    Step 2 - Validate List

    # Helper functions to update proxy state
    set_working <- function(proxy) {
      proxy$status <- "working"
    }
    
    set_failed <- function(proxy) {
      proxy$status <- "failed"
    }
    
    # Test each proxy
    for(i in seq_len(nrow(proxies))) {
    
      tryCatch({
    
        proxy <- proxies[i,]
    
        # Configure proxy
        httr::set_config(
          httr::use_proxy(paste(proxy$ip, proxy$port, sep=":"))
        )
    
        # Test if gets 200 status
        resp <- GET("<http://httpbin.org/ip>")
        status_code <- status_code(resp)
    
        # If working, tag proxy
        if(status_code == 200) {
          set_working(proxy)
        } else {
          set_failed(proxy)
        }
    
      }, error = function(err) {
       # Tag on any errors
       set_failed(proxy)
      })
    
    }
    
    # Filter working proxies
    proxies <- proxies %>%
      filter(status == "working")
    

    This loops through each proxy attempting a test request and categorizing them as per response. Finally we filter to only the working ones for further usage.

    Step 3 - Pick Proxy Randomly

    Now that we have a sanity checked list of active proxies, we can integrate it into our main scraper code:

    # Select random proxy
    get_random_proxy <- function() {
    
      proxy <- sample_n(proxies, 1)
      proxy <- proxy$ip:proxy$port # Construct proxy url
    
      httr::set_config(httr::use_proxy(proxy)) # Configure
    
    }
    
    # Usage
    get_random_proxy()
    webpage <- read_html("<http://httpbin.org/ip>")
    

    Every scraper request triggers a fresh proxy IP leading to effortless rotation!

    Step 4 - Recycle Failed Proxies

    To make full use of resources, we should re-check failed proxies after some time as they could come back online.

    Adding a scheduled job to re-validate and promoting recovered ones back to the working pool completes a robust IP rotating system for rvest.

    # Re-check failed proxies
    retry_failed_proxies <- function() {
    
      failed_proxies <- filter(proxies, status == "failed")
    
      for(i in seq_len(nrow(failed_proxies))) {
    
        # Re-run validation steps
    
      }
    
      # Add back working proxies
      proxies <- bind_rows(proxies, validated_proxies)
    
    }
    
    # Schedule job
    Sys.setenv(TZ="UTC")
    schedule <- scheduleJob("0 */4 * * *", retry_failed_proxies) # Every 4 hrs
    

    This revolutionized stability for my long running commercial web scrapers!

    Now that you understand how to configure and rotate proxies for optimal performance, let's go through some pro-tips and best practices worth implementing.

    Pro Proxy Tips for Expert-Level Web Scraping in R

    Over the years, I've learned many small proxy nuances through trial & error which have levelled up my scraping capabilities significantly.

    Here are some pro suggestions worth incorporating:

    Filter Proxy Locations

    Certain sites serve customized homepage content and product catalogs based on visitor geo-location. It's invaluable then to filter proxy lists by country for consistency:

    # Helper function to select country
    get_country_proxy <- function(country) {
    
      # Filter proxy dataframe by country code
      country_proxies <- filter(proxies, country == ??US??)
    
      # Other steps same as random selection
    
    }
    
    # Usage
    get_country_proxy("US")
    webpage <- read_html("<http://www.xyzshop.com/>")
    

    This extracts the US version reliably despite proxies switching.

    Authenticate through Proxy

    Sites requiring login credentials before scraping data require authenticated proxy support:

    # Construct authenticated proxy url
    proxy <- "<http://user:pass@IP>:port"
    
    # Add user-agent headers for stealth
    curloptions <- curlOptions(
      useragent = "Mozilla/5.0",
      httpheader = c(Accept = "text/html")
    )
    
    # Login and extract cookie
    login <- POST(url = "<http://www.website.com/login>",
                  body = list(email = "user@email.com", password = "****"),
                  curloptions,
                  httr::use_proxy(proxy)
    )
    
    # Store + reuse cookie for scraping
    cookie <- content(login, "parsed")$response$cookies
    
    # Scraping steps...
    GET(url, curloptions, httr::use_proxy(proxy), cookies = cookie)
    

    This logs in just once saving the authenticated session for subsequent scraping requests.

    Integrate Selenium for JS Sites

    An extreme challenge I faced was scraping complex JavaScript rendered sites like Facebook and Linkedin to extract user profile info.

    The RSelenium package which ports Selenium webdriver to R came to the rescue:

    # Launch headless Chrome browser proxyed through selenium
    remDr$extraCapabilities <- makeSeleniumProxy(proxy)
    remDr <- remoteDriver$new(port = 4445L, extraCapabilities = caps)
    remDr$open()
    
    # Navigate to site
    remDr$navigate("<https://www.linkedin.com/feed/>")
    
    # Extract info
    profiles <- remDr$findElements(using = 'css selector', "div.profile")
    names <- sapply(profiles, function(x) x$getElementText())
    

    The key learning here was around configuring Selenium to route traffic through proxies. With that figuring out any JS site was a breeze!

    I hope these tips help you become an expert proxy handler in R! Let's conclude with some final words of wisdom.

    Key Takeaways - The Ideal Proxy Setup for Web Scraping

    After having configured proxies across dozens of scrapers over the years, here is what I think comprises an ideal setup:

  • Maintain a frequently updated pool of reliable residential IPs preferably spanning different geographies
  • Have a central proxy rotation logic which checks, filters and recycles failed ones periodically
  • Enrich list with authenticated proxies to handle sites needing logins
  • Intelligently filter proxies to scrape region specific content
  • Support Selenium based scraping for dynamic JS sites
  • Getting all of the above right can get extremely complex quickly cutting into the time required for actual data analytics.

    My advice therefore is outsourcing proxy management to capable third party services like Proxies API handling the heavy lifting through simple APIs. Proxies API is my own SAAS service.

    It takes care of sourcing, validating and rotating millions of residential IPs automatically in the backend across footprints globally while exposing a straightforward interface to manage proxies across all my scrapers in Python, R, PHP etc.

    Definitely give Proxies API a spin with our 1000 request trial offer to simplify your scraping infrastructure.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!