Scraping Without Headaches: Using Scala and scalaj.http with Proxy Servers

Overview of Scalaj.http

For the uninitiated, scalaj.http is a convenient Scala wrapper over Java's HttpURLConnection to make HTTP requests.

It has a simple API that lets you do stuff like:

import scalaj.http._

val response = Http("<http://www.example.com>").asString
print(response.body)

This simplicity makes it quite popular, and if proxies are configured correctly, you can scale your scraping efforts without headaches.

Let's see the common proxy server options available.

Know Your Proxies

There are largely two types of proxies in use:

HTTP Proxy: These forward HTTP and HTTPS traffic unchanged. Sites see the proxy's IP/location, not yours. But the traffic is unencrypted between you and the proxy.
SOCKS Proxy: These route any TCP traffic, including HTTP and HTTPS. Traffic is encrypted end-to-end. But they can be slower as the proxy has to process all traffic.

Within these, you also have options like:

Shared Proxies: Cheap but risks blocks if too many people use them.

Dedicated Proxies: More expensive but you get the proxy's full bandwidth.

Rotating Proxies: A pool of proxies where each request uses a different proxy for maximum anonymity.

Now let's see how to configure them in scalaj.http.

Basic Proxy Setup in Scalaj.http

The simplest way is to use Http.proxy() and pass the proxy host, port and type:

import scalaj.http._

val proxyHost = "1234.myproxy.com"
val proxyPort = 8080

val response = Http("<http://www.example.com>")
    .proxy(proxyHost, proxyPort)
    .asString

Here my HTTP requests route through the proxy 1234.myproxy.com on port 8080. The target site sees the proxy's IP.

We can also specify the proxy type explicitly. This defaults to HTTP:

val proxyHost = "1234.myproxy.com"
val proxyPort = 8080
val proxyType = Proxy.Type.SOCKS // Can also be HTTP

// Route requests over a SOCKS proxy
val response = Http("<http://www.example.com>")
    .proxy(proxyHost, proxyPort, proxyType)
    .asString

And that's it for basic configuration!

But in the real world, you often have to deal with stuff like authenticated proxies and HTTPS sites.

Dealing With Authentication

Many proxy providers require authentication to prevent abuse.

This involves dealing with the Proxy-Authorization header, which tripped me up due to poor documentation.

Here is how to authenticate your Scala proxy requests:

import java.net.InetSocketAddress
import scalaj.http.ProxyServerCredentials

val proxyHost = "buyproxy.com"
val proxyPort = 3128

// Setup proxy auth credentials
val proxyAuth = ProxyServerCredentials(username = "my_username", password = "1234")

// Authenticate proxy
val response = Http("<http://www.example.com>")
    .proxy(InetSocketAddress.createUnresolved(proxyHost, proxyPort), proxyAuth)
    .asString

We pass the credentials to the proxy method. The key thing is it expects an InetSocketAddress instead of plain host and port.

This took me hours to figure out through trial and error! But now I can use any authenticated proxy easily.

HTTPS Calls Over Proxy

Things get slightly tricky when using HTTPS sites compared to plain HTTP.

Many proxy providers explicitly only support HTTPS traffic tunneling and not normal HTTP requests.

This means your Scala client will throw errors like "407 Proxy Authentication Required" on HTTP sites but work fine for HTTPS sites.

After banging my head debugging non-working proxies, I found the reason is that HTTPS uses the CONNECT tunneling method under the hood.

So with HTTPS sites specifically:

import scalaj.http._

val proxyHost = "buyproxy.com"
val proxyPort = 3128

// Route HTTPS traffic through proxy
Http("<https://www.some-https-site.com>")
    .proxy(proxyHost, proxyPort)
    .asString

This lets my HTTPS requests tunnel safely through the proxy.

If you still get 407 authentication errors on some HTTPS sites, reconsider your proxy choice. Not all providers support tunneling.

Going Pro with Custom Transports

The .proxy() method is convenient but limited.

For advanced use cases, scalaj.http lets you plugin custom transports with more control.

For example, this shows implementing a custom transport to rotate across multiple proxies randomly:

import scalaj.Http

// Container for proxy host/ports
case class Proxy(host:String, port:Int)

// Custom transport class
class RandomProxyTransport extends ClientTransport  {

  // Available proxies
  private val proxies = Vector(
    Proxy("proxy1.com", 8000),
    Proxy("proxy2.com", 8000),
    Proxy("proxy3.com", 8000)
  )

  override def connectTo(
    host: String,
    port: Int,
    settings: ConnectionSettings
  )(implicit system: ActorSystem): Flow[ByteString, ByteString, Future[OutgoingConnection]] = {

      // Pick a random proxy
      val randProxy = util.Random.nextInt(proxies.size)
      val proxy = proxies(randProxy)

      // Connect using the randomly chosen proxy
      val transport = Http().getDefaultClientTransport()
      transport.connectTo(proxy.host, proxy.port, settings)
  }
}

// Usage:

val transport = new RandomProxyTransport()
val settings = ConnectionPoolSettings(system).withTransport(transport)

Http().singleRequest(requst, settings = settings)

This allows me to add logic for stuff like:

Proxy authentication

Custom proxy retry logic

Load balancing across proxy pools

The sky's the limit!

How Akka HTTP Compares

The other popular Scala http library is Akka HTTP.

It has some similarities with Scalaj in proxy configuration:

Support for HTTP and HTTPS proxy tunneling

Custom transport plugins possible

However through my experience, Akka HTTP has a steeper learning curve compared to the simplicity of Scalaj.

Another key difference is that Akka HTTP supports proxy authentication out-of-the-box while Scalaj required custom handling.

So if your use case is simple scraping, I found Scalaj faster to get off the ground with. But Akka HTTP offers richer features for complex use cases.

Putting It All Together

Let's take stock of what we've learnt through an example use case.

Say I want to scrape user profile information from the site https://some-site.com.

My IP kept getting blocked so I got a rotating residential proxy service superproxy.io with US and Europe proxies.

It requires authentication so I have the credentials.

Here's how I can leverage all techniques learnt:

import scalaj.http._
import java.net.InetSocketAddress

// Credentials
val username = "my_username"
val password = "my_secret"

// Pick a random proxy
val proxyHost = getRandProxy()
val proxyPort = 8080

val proxyAuth = ProxyServerCredentials(username, password)

// Route requests over rotating authenticated proxy
val response = Http("<https://some-site.com/users/john_doe>")
    .proxy(InetSocketAddress.createUnresolved(proxyHost, proxyPort), proxyAuth)
    .asString

// Extract data..
val name = extractName(response.body)
println(name)

This lets me keep scraping day and night without IP blocks or usage limits!

Some key things:

Geo rotation to avoid region blocks

Username/password auth handling

Custom transports possible for advanced rotation logic

Phew, that was quite the epic journey!

We went from the basics of using Scala and scalaj.http to leveraging proxies for effective large-scale web scraping without headaches.

Hopefully the practical tips shared here based on painful experience will help you avoid common scraper issues like captchas and blocks.

Of course, self-hosting and maintaining proxies introduces operational complexities. The authentication, tunneling, region rotation, proxy refreshing to avoid blocks involves quite some engineering.

My Proxies API service which takes care of these complexities through a simple API.

With auto region rotating residential proxies, captcha solving and built-in retry logic, it lets me focus on writing scraping logic instead of proxy management.

It also seamlessly handles Javascript rendering, cookies, headers etc. Do check it out if you want to scrape at scale without headaches!

Scraping Without Headaches: Using Scala and scalaj.http with Proxy Servers

Overview of Scalaj.http

Know Your Proxies

Basic Proxy Setup in Scalaj.http

Dealing With Authentication

HTTPS Calls Over Proxy

Going Pro with Custom Transports

How Akka HTTP Compares

Putting It All Together

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping Without Headaches: Using Scala and scalaj.http with Proxy Servers

Overview of Scalaj.http

Know Your Proxies

Basic Proxy Setup in Scalaj.http

Dealing With Authentication

HTTPS Calls Over Proxy

Going Pro with Custom Transports

How Akka HTTP Compares

Putting It All Together

The easiest way to do Web Scraping

Don't leave just yet!