Scraping Without Headaches: Using Scala and scalaj.http with Proxy Servers

Jan 9, 2024 ยท 6 min read

Overview of Scalaj.http

For the uninitiated, scalaj.http is a convenient Scala wrapper over Java's HttpURLConnection to make HTTP requests.

It has a simple API that lets you do stuff like:

import scalaj.http._

val response = Http("<http://www.example.com>").asString
print(response.body)

This simplicity makes it quite popular, and if proxies are configured correctly, you can scale your scraping efforts without headaches.

Let's see the common proxy server options available.

Know Your Proxies

There are largely two types of proxies in use:

  1. HTTP Proxy: These forward HTTP and HTTPS traffic unchanged. Sites see the proxy's IP/location, not yours. But the traffic is unencrypted between you and the proxy.
  2. SOCKS Proxy: These route any TCP traffic, including HTTP and HTTPS. Traffic is encrypted end-to-end. But they can be slower as the proxy has to process all traffic.

Within these, you also have options like:

  • Shared Proxies: Cheap but risks blocks if too many people use them.
  • Dedicated Proxies: More expensive but you get the proxy's full bandwidth.
  • Rotating Proxies: A pool of proxies where each request uses a different proxy for maximum anonymity.
  • Now let's see how to configure them in scalaj.http.

    Basic Proxy Setup in Scalaj.http

    The simplest way is to use Http.proxy() and pass the proxy host, port and type:

    import scalaj.http._
    
    val proxyHost = "1234.myproxy.com"
    val proxyPort = 8080
    
    val response = Http("<http://www.example.com>")
        .proxy(proxyHost, proxyPort)
        .asString
    

    Here my HTTP requests route through the proxy 1234.myproxy.com on port 8080. The target site sees the proxy's IP.

    We can also specify the proxy type explicitly. This defaults to HTTP:

    val proxyHost = "1234.myproxy.com"
    val proxyPort = 8080
    val proxyType = Proxy.Type.SOCKS // Can also be HTTP
    
    // Route requests over a SOCKS proxy
    val response = Http("<http://www.example.com>")
        .proxy(proxyHost, proxyPort, proxyType)
        .asString
    

    And that's it for basic configuration!

    But in the real world, you often have to deal with stuff like authenticated proxies and HTTPS sites.

    Dealing With Authentication

    Many proxy providers require authentication to prevent abuse.

    This involves dealing with the Proxy-Authorization header, which tripped me up due to poor documentation.

    Here is how to authenticate your Scala proxy requests:

    import java.net.InetSocketAddress
    import scalaj.http.ProxyServerCredentials
    
    val proxyHost = "buyproxy.com"
    val proxyPort = 3128
    
    // Setup proxy auth credentials
    val proxyAuth = ProxyServerCredentials(username = "my_username", password = "1234")
    
    // Authenticate proxy
    val response = Http("<http://www.example.com>")
        .proxy(InetSocketAddress.createUnresolved(proxyHost, proxyPort), proxyAuth)
        .asString
    

    We pass the credentials to the proxy method. The key thing is it expects an InetSocketAddress instead of plain host and port.

    This took me hours to figure out through trial and error! But now I can use any authenticated proxy easily.

    HTTPS Calls Over Proxy

    Things get slightly tricky when using HTTPS sites compared to plain HTTP.

    Many proxy providers explicitly only support HTTPS traffic tunneling and not normal HTTP requests.

    This means your Scala client will throw errors like "407 Proxy Authentication Required" on HTTP sites but work fine for HTTPS sites.

    After banging my head debugging non-working proxies, I found the reason is that HTTPS uses the CONNECT tunneling method under the hood.

    So with HTTPS sites specifically:

    import scalaj.http._
    
    val proxyHost = "buyproxy.com"
    val proxyPort = 3128
    
    // Route HTTPS traffic through proxy
    Http("<https://www.some-https-site.com>")
        .proxy(proxyHost, proxyPort)
        .asString
    

    This lets my HTTPS requests tunnel safely through the proxy.

    If you still get 407 authentication errors on some HTTPS sites, reconsider your proxy choice. Not all providers support tunneling.

    Going Pro with Custom Transports

    The .proxy() method is convenient but limited.

    For advanced use cases, scalaj.http lets you plugin custom transports with more control.

    For example, this shows implementing a custom transport to rotate across multiple proxies randomly:

    import scalaj.Http
    
    // Container for proxy host/ports
    case class Proxy(host:String, port:Int)
    
    // Custom transport class
    class RandomProxyTransport extends ClientTransport  {
    
      // Available proxies
      private val proxies = Vector(
        Proxy("proxy1.com", 8000),
        Proxy("proxy2.com", 8000),
        Proxy("proxy3.com", 8000)
      )
    
      override def connectTo(
        host: String,
        port: Int,
        settings: ConnectionSettings
      )(implicit system: ActorSystem): Flow[ByteString, ByteString, Future[OutgoingConnection]] = {
    
          // Pick a random proxy
          val randProxy = util.Random.nextInt(proxies.size)
          val proxy = proxies(randProxy)
    
          // Connect using the randomly chosen proxy
          val transport = Http().getDefaultClientTransport()
          transport.connectTo(proxy.host, proxy.port, settings)
      }
    }
    
    // Usage:
    
    val transport = new RandomProxyTransport()
    val settings = ConnectionPoolSettings(system).withTransport(transport)
    
    Http().singleRequest(requst, settings = settings)
    

    This allows me to add logic for stuff like:

  • Proxy authentication
  • Custom proxy retry logic
  • Load balancing across proxy pools
  • The sky's the limit!

    How Akka HTTP Compares

    The other popular Scala http library is Akka HTTP.

    It has some similarities with Scalaj in proxy configuration:

  • Support for HTTP and HTTPS proxy tunneling
  • Custom transport plugins possible
  • However through my experience, Akka HTTP has a steeper learning curve compared to the simplicity of Scalaj.

    Another key difference is that Akka HTTP supports proxy authentication out-of-the-box while Scalaj required custom handling.

    So if your use case is simple scraping, I found Scalaj faster to get off the ground with. But Akka HTTP offers richer features for complex use cases.

    Putting It All Together

    Let's take stock of what we've learnt through an example use case.

    Say I want to scrape user profile information from the site https://some-site.com.

    My IP kept getting blocked so I got a rotating residential proxy service superproxy.io with US and Europe proxies.

    It requires authentication so I have the credentials.

    Here's how I can leverage all techniques learnt:

    import scalaj.http._
    import java.net.InetSocketAddress
    
    // Credentials
    val username = "my_username"
    val password = "my_secret"
    
    // Pick a random proxy
    val proxyHost = getRandProxy()
    val proxyPort = 8080
    
    val proxyAuth = ProxyServerCredentials(username, password)
    
    // Route requests over rotating authenticated proxy
    val response = Http("<https://some-site.com/users/john_doe>")
        .proxy(InetSocketAddress.createUnresolved(proxyHost, proxyPort), proxyAuth)
        .asString
    
    // Extract data..
    val name = extractName(response.body)
    println(name)
    

    This lets me keep scraping day and night without IP blocks or usage limits!

    Some key things:

  • Geo rotation to avoid region blocks
  • Username/password auth handling
  • Custom transports possible for advanced rotation logic
  • Phew, that was quite the epic journey!

    We went from the basics of using Scala and scalaj.http to leveraging proxies for effective large-scale web scraping without headaches.

    Hopefully the practical tips shared here based on painful experience will help you avoid common scraper issues like captchas and blocks.

    Of course, self-hosting and maintaining proxies introduces operational complexities. The authentication, tunneling, region rotation, proxy refreshing to avoid blocks involves quite some engineering.

    My Proxies API service which takes care of these complexities through a simple API.

    With auto region rotating residential proxies, captcha solving and built-in retry logic, it lets me focus on writing scraping logic instead of proxy management.

    It also seamlessly handles Javascript rendering, cookies, headers etc. Do check it out if you want to scrape at scale without headaches!

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!