Introduction to Scraping Reddit Posts in Scala

Jan 9, 2024 · 6 min read

In this beginner-friendly guide, we will walk through a complete Scala program to scrape content from Reddit. We will extract key information like post titles, permalinks, authors and scores from Reddit posts on a webpage.

here is the page we are talking about

This will demonstrate a practical real-world example of web scraping using Scala's helpful libraries to download webpages and parse their content.

Understanding the Imports

Let's briefly go over what each import is doing:

import play.api.libs.ws._
import play.api.libs.ws.ahc._

These imports provide the web scraping capability through Play Framework's WS library (AhcWSClient).

import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Await
import scala.concurrent.duration._

These imports allow asynchronous HTTP requests and waiting for their responses.

import org.jsoup.Jsoup
import org.jsoup.nodes.Document

Jsoup is a Java library used to parse and extract data from HTML and XML documents. We will use it to analyze the content we download.

Walkthrough of Main Components

Let's now understand this code section-by-section:

val redditUrl = "<https://www.reddit.com>"

This defines the Reddit URL we want to scrape. We are getting the front page.

val headers = Seq(
    "User-Agent" -> "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
)

This header makes the request appear to be from a real Chrome browser. Sites like Reddit block automated scrapers so this fools their defenses.

val wsClient = AhcWSClient()

This creates an Async HTTP client using the AhcWSClient() factory method.

val responseFuture = wsClient.url(redditUrl).withHttpHeaders(headers:_*).get()
val response = Await.result(responseFuture, 10.seconds)

Here we make the GET request to the Reddit URL with our spoofed browser headers. We await the response for 10 seconds.

if (response.status == 200) {
   // process content
} else {
   println(s"Failed, status code ${response.status}")
}

We check if the status code returned is 200 OK. If so, we have successfully downloaded the page and can process the content.

val htmlContent = response.body[String]

This extracts the HTML content from the response body as a String.

scala.tools.nsc.io.File(filename).writeAll(htmlContent)

We write the HTML string into a file for later analysis.

val document: Document = Jsoup.parse(htmlContent)

Jsoup parses the HTML string into a Document object we can now query.

Extracting Data with Selectors

The most complex part of this program is extracting the exact data we want from the HTML content.

Inspecting the elements

Upon inspecting the HTML in Chrome, you will see that each of the posts have a particular element shreddit-post and class descriptors specific to them…

Jsoup provides CSS-style selectors to target elements by IDs, classes, attributes etc. This takes practice but is very powerful.

Let's break down this selector:

val blocks = document.select(".block.relative.cursor-pointer.bg-neutral-background.focus-within:bg-neutral-background-hover.hover:bg-neutral-background-hover.xs:rounded-\\[16px\\].p-md.my-2xs.nd:visible")

The .select() method searches the document for elements matching a CSS selector.

Our selector finds all elements with the CSS classes block AND relative AND cursor-pointer and so on, chained with periods.

This uniquely targets the Reddit post block elements even though the class names are generic.

Let's walk through extracting each field:

Permalink

val permalink = block.attr("permalink")

The attr() method gets the value of an attribute, here permalink.

Content URL

val contentHref = block.attr("content-href")

Similar to above, this selector extracts the post's content URL from the content-href attribute.

Comment Count

val commentCount = block.attr("comment-count")

The number of comments is stored conveniently in the comment-count attribute.

Post Title

val postTitle = block.select("div[slot=title]").text().trim

Here we search within each block for a div with a slot attribute equal to title. We extract the title text and trim whitespace.

Author

val author = block.attr("author")

Each block contains the post author's name in an author attribute.

Score

val score = block.attr("score")

The post score comes from the score attribute on each post block.

And that's the process! As you can see, CSS selectors are immensely powerful for locating and extracting anything in an HTML document.

Putting It All Together

Now that we've broken down the components, here is the full code to scrape Reddit posts:

import play.api.libs.ws._
import play.api.libs.ws.ahc._
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Await
import scala.concurrent.duration._
import org.jsoup.Jsoup
import org.jsoup.nodes.Document

object RedditScraper extends App {

  // Define the Reddit URL you want to download
  val redditUrl = "https://www.reddit.com"

  // Define a User-Agent header
  val headers = Seq(
    "User-Agent" -> "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
  )

  // Create an AsyncHttpClient instance
  val wsClient = AhcWSClient()

  // Send a GET request to the URL with the User-Agent header
  val responseFuture = wsClient.url(redditUrl).withHttpHeaders(headers: _*).get()

  val response = Await.result(responseFuture, 10.seconds)

  // Check if the request was successful (status code 200)
  if (response.status == 200) {
    // Get the HTML content of the page
    val htmlContent = response.body[String]

    // Specify the filename to save the HTML content
    val filename = "reddit_page.html"

    // Save the HTML content to a file
    scala.tools.nsc.io.File(filename).writeAll(htmlContent)

    println(s"Reddit page saved to $filename")

    // Parse the HTML content
    val document: Document = Jsoup.parse(htmlContent)

    // Find all blocks with the specified tag and class
    val blocks = document.select(".block.relative.cursor-pointer.bg-neutral-background.focus-within:bg-neutral-background-hover.hover:bg-neutral-background-hover.xs:rounded-[16px].p-md.my-2xs.nd:visible")

    // Iterate through the blocks and extract information from each one
    blocks.forEach { block =>
      val permalink = block.attr("permalink")
      val contentHref = block.attr("content-href")
      val commentCount = block.attr("comment-count")
      val postTitle = block.select("div[slot=title]").text().trim
      val author = block.attr("author")
      val score = block.attr("score")

      // Print the extracted information for each block
      println(s"Permalink: $permalink")
      println(s"Content Href: $contentHref")
      println(s"Comment Count: $commentCount")
      println(s"Post Title: $postTitle")
      println(s"Author: $author")
      println(s"Score: $score")
      println("\n")
    }
  } else {
    println(s"Failed to download Reddit page (status code ${response.status})")
  }

  // Close the AsyncHttpClient
  wsClient.close()
}

Browse by tags:

Browse by language:

Tired of getting blocked while scraping the web?

ProxiesAPI handles headless browsers and rotates proxies for you.
Get access to 1,000 free API credits, no credit card required!