Scraping Hacker News with Scala

Jan 21, 2024 · 9 min read

Web scraping is a technique for extracting information from websites. In this beginner Scala tutorial, we'll walk through code that scrapes article data from the Hacker News homepage using the Jsoup Java library.

This is the page we are talking about…

Installation

First, ensure Jsoup is installed by adding the following Maven dependency to your project:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.13.1</version>
</dependency>

Overview

Here is the full code we'll be going through:

import org.jsoup.Jsoup

object HackerNewsScraper {

def main(args: Array[String]): Unit = {

    // Define the URL of the Hacker News homepage
    val url = "<https://news.ycombinator.com/>"

    // Send a GET request to the URL and parse the HTML content
    val doc = Jsoup.connect(url).get()

    // Find all rows in the table
    val rows = doc.select("tr")

    // Initialize variables to keep track of the current article and row type
    var currentArticle: org.jsoup.nodes.Element = null
    var currentRowType: String = null

    // Iterate through the rows to scrape articles
    for (row <- rows) {

        if (row.hasClass("athing")) {
            // This is an article row
            currentArticle = row
            currentRowType = "article"

        } else if (currentRowType == "article") {
            // This is the details row

            if (currentArticle != null) {
                // Extract information from the current article and details row

                val titleElem = currentArticle.select("span.titleline")

                if (titleElem != null && titleElem.size() > 0) {

                    val articleTitle = titleElem.select("a").text() // Get the text of the anchor element

                    val articleUrl = titleElem.select("a").attr("href") // Get the href attribute of the anchor element

                    val subtext = row.select("td.subtext")

                    val points = subtext.select("span.score").text()

                    val author = subtext.select("a.hnuser").text()

                    val timestamp = subtext.select("span.age").attr("title")

                    val commentsElem = subtext.select("a:contains(comments)")

                    val comments = if (commentsElem != null && commentsElem.size() > 0) commentsElem.text() else "0"

                    // Print the extracted information
                    println("Title: " + articleTitle)
                    println("URL: " + articleUrl)
                    println("Points: " + points)
                    println("Author: " + author)
                    println("Timestamp: " + timestamp)
                    println("Comments: " + comments)
                    println("-" * 50) // Separating articles
                }
            }

            // Reset the current article and row type
            currentArticle = null
            currentRowType = null

        } else if (row.attr("style") == "height:5px") {
            // This is the spacer row, skip it
            // do nothing
        }

    }

}

}

In the rest of this article, we'll break down what each section is doing.

Importing Jsoup

We first import Jsoup, which is the Java library that allows us to make HTTP requests and parse HTML:

import org.jsoup.Jsoup

Defining the Entry Point

Next we define a Scala object with a main method to serve as the entry point when running the code:

object HackerNewsScraper {

    def main(args: Array[String]): Unit = {

        // scraping code goes here

    }

}

Getting the Hacker News Page HTML

Inside the main method, we start by defining the URL of the Hacker News homepage:

val url = "<https://news.ycombinator.com/>"

We then use Jsoup to send a GET request to this URL and parse/load the returned HTML content:

val doc = Jsoup.connect(url).get()

The doc variable now contains a Jsoup Document representing the parsed HTML document of the Hacker News homepage.

Selecting Rows

Inspecting the page

You can notice that the items are housed inside a tag with the class athing

With the HTML document loaded, we next find all table rows on the page using a CSS selector:

val rows = doc.select("tr")

This gives us a list of elements containing each to iterate through later.

Tracking State

We initialize two variables to keep track of state as we parse each row:

var currentArticle: org.jsoup.nodes.Element = null
var currentRowType: String = null
  • currentArticle holds the current article row element
  • currentRowType tracks if we are on an "article" or "details" row
  • Iterating Rows

    We loop through each row to identify article content:

    for (row <- rows) {
    
        // parsing logic
    
    }
    

    Inside this loop is the main parsing logic...

    Identifying Article Rows

    We first check if a row has the "athing" class to determine if it is an article:

    if (row.hasClass("athing")) {
    
        currentArticle = row
        currentRowType = "article"
    
    }
    

    If so, we save the row to currentArticle and set currentRowType accordingly.

    Identifying Details Rows

    Next, we check if the previous row type was an "article", indicating this is a details row:

    } else if (currentRowType == "article") {
    
        // extract details for current article
    
    }
    

    Extracting Article Data

    Inside the details row, we can now extract information from the article and details row since we have both available in currentArticle:

    // Extract information from current article and details row
    
    val titleElem = currentArticle.select("span.titleline")
    
    if (titleElem != null && titleElem.size() > 0) {
    
        val articleTitle = titleElem.select("a").text()
    
        val articleUrl = titleElem.select("a").attr("href")
    
        // extract other fields like points, author etc.
    
        println(articleTitle)
        println(articleUrl)
    
    }
    

    Let's focus on how the title field is extracted...

    First, we select the element from currentArticle which contains the title anchor tag:

    val titleElem = currentArticle.select("span.titleline")
    

    Notice here we are using a CSS selector again, targeting the element by class name.

    If titleElem exists, we extract the text content of the anchor tag with:

    val articleTitle = titleElem.select("a").text()
    

    This drills into -> to return the text inside the link.

    The URL extraction works similarly, getting the href attribute value instead of text content.

    The same process is followed to extract points, author, comments etc. each into their own variables by selecting different elements from either the article or details row.

    Finally, we print the information:

    The other fields like points, author, and comments are extracted in the same way using Jsoup selectors targeting elements by class or tag name.

    Resetting State

    After extracting article details, we reset currentArticle and currentRowType to prepare for the next article row:

    Skipping Spacer Rows

    We also include logic to detect and skip spacer rows:

    This checks for a particular style attribute indicating a spacer.

    And that covers the full logic to scrape article information from Hacker News using Jsoup selectors!

    Full Code

    Here is the complete runnable code again for reference:

    import org.jsoup.Jsoup
    
    object HackerNewsScraper {
      def main(args: Array[String]): Unit = {
        // Define the URL of the Hacker News homepage
        val url = "https://news.ycombinator.com/"
    
        // Send a GET request to the URL and parse the HTML content
        val doc = Jsoup.connect(url).get()
    
        // Find all rows in the table
        val rows = doc.select("tr")
    
        // Initialize variables to keep track of the current article and row type
        var currentArticle: org.jsoup.nodes.Element = null
        var currentRowType: String = null
    
        // Iterate through the rows to scrape articles
        for (row <- rows) {
          if (row.hasClass("athing")) {
            // This is an article row
            currentArticle = row
            currentRowType = "article"
          } else if (currentRowType == "article") {
            // This is the details row
            if (currentArticle != null) {
              // Extract information from the current article and details row
              val titleElem = currentArticle.select("span.titleline")
              if (titleElem != null && titleElem.size() > 0) {
                val articleTitle = titleElem.select("a").text() // Get the text of the anchor element
                val articleUrl = titleElem.select("a").attr("href") // Get the href attribute of the anchor element
    
                val subtext = row.select("td.subtext")
                val points = subtext.select("span.score").text()
                val author = subtext.select("a.hnuser").text()
                val timestamp = subtext.select("span.age").attr("title")
                val commentsElem = subtext.select("a:contains(comments)")
                val comments = if (commentsElem != null && commentsElem.size() > 0) commentsElem.text() else "0"
    
                // Print the extracted information
                println("Title: " + articleTitle)
                println("URL: " + articleUrl)
                println("Points: " + points)
                println("Author: " + author)
                println("Timestamp: " + timestamp)
                println("Comments: " + comments)
                println("-" * 50) // Separating articles
              }
            }
            // Reset the current article and row type
            currentArticle = null
            currentRowType = null
          } else if (row.attr("style") == "height:5px") {
            // This is the spacer row, skip it
            // do nothing
          }
        }
      }
    }

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!