Scraping Hacker News with Kotlin

Jan 21, 2024 · 7 min read

Hacker News is a popular social news website in the technology and startup community. In this beginner tutorial, we will scrape key details from Hacker News article listings using Kotlin and print article titles, URLs, points, authors, timestamps and comment counts.

This is the page we are talking about…

Imports

Let's look at the imports needed:

import okhttp3.OkHttpClient
import okhttp3.Request
import org.jsoup.Jsoup
  • OkHttpClient - Makes HTTP requests to fetch content
  • Request - Represents an HTTP request
  • Jsoup - Java library to parse and extract data from HTML
  • These dependencies allow us to retrieve the Hacker News page and parse its content.

    Fetching the Page

    First we define the Hacker News homepage URL:

    val url = "<https://news.ycombinator.com/>"
    

    Next we create an OkHttpClient instance to make requests:

    val client = OkHttpClient()
    

    And build a simple GET request for the URL:

    val request = Request.Builder()
        .url(url)
        .build()
    

    Finally we execute the request and handle the response:

    client.newCall(request).execute().use { response ->
    
        // Parse response here
    
    }
    

    This sends the GET request and we access the response body inside the lambda.

    Parsing the Page with Jsoup

    Inside the response handler, we first parse the HTML:

    val html = response.body!!.string()
    val document = Jsoup.parse(html)
    

    This loads up a Jsoup Document object we can now query to extract data.

    Scraping Rows with Selectors

    Inspecting the page

    You can notice that the items are housed inside a tag with the class athing

    Jsoup uses CSS-style selectors to find elements. Let's get all rows from the table:

    val rows = document.select("tr")
    

    Matches:

    <tr>...</tr>
    <tr class="athing">...</tr>
    ...
    

    We iterate over the rows, keeping track of article and row type:

    var currentArticle = ""
    var currentRowType = ""
    
    for (row in rows) {
    
       // Check type of row
       // Extract data
       // Update current* variables
    
    }
    

    Getting Article Rows

    We first check if a row has the "athing" class - this denotes an article listing:

    if (row.hasClass("athing")) {
    
        currentArticle = row.toString()
        currentRowType = "article"
    
    }
    

    We save the full HTML of the article row to scrape details next.

    Scraping Article Details

    After an article row, the next row contains key details like title, URL, points etc. We handle this case:

    } else if (currentRowType == "article") {
    
        // Extract article details here
    
        // Reset current* variables
        currentArticle = ""
        currentRowType = ""
    
    }
    

    Inside here we use other selectors to extract specific parts:

    Title and URL

    val titleElem = row.selectFirst("span.titleline")
    
    if (titleElem != null) {
    
        val articleTitle = titleElem.select("a").text()
        val articleUrl = titleElem.select("a").attr("href")
    
    }
    

    Matches:

    <span class="titleline">
      <a href="item?id=37497275">Ask HN: Am I the only one still using Vim?</a>
    </span>
    

    We specifically get the link text and href attribute.

    Other Details

    Similarly, we fetch points, author, timestamp, comments:

    val subtext = row.selectFirst("td.subtext")
    
    val points = subtext?.selectFirst("span.score")?.text() ?: "0"
    val author = subtext?.selectFirst("a.hnuser")?.text() ?: ""
    val timestamp = subtext?.selectFirst("span.age")?.attr("title") ?: ""
    
    val commentsElem = subtext?.selectFirst("a:contains(comments)")
    val comments = commentsElem?.text() ?: "0"
    

    Using the td.subtext element and other descriptors inside it. We null check and provide defaults where needed.

    Printing Extracted Article Details

    Finally, we print all extracted details:

    println("Title: $articleTitle")
    println("URL: $articleUrl")
    println("Points: $points")
    println("Author: $author")
    println("Timestamp: $timestamp")
    println("Comments: $comments")
    
    println("-".repeat(50)) // Separator
    

    This outputs each article's details!

    The full code is below to scrape Hacker News in Kotlin. Hopefully this gives a good look at using real-world libraries like OkHttp and Jsoup along with CSS selectors to easily extract content from websites.

    import okhttp3.OkHttpClient
    import okhttp3.Request
    import org.jsoup.Jsoup
    
    fun main() {
        // Define the URL of the Hacker News homepage
        val url = "https://news.ycombinator.com/"
    
        // Create an OkHttpClient instance
        val client = OkHttpClient()
    
        // Create a GET request
        val request = Request.Builder()
            .url(url)
            .build()
    
        // Send the GET request and handle the response
        client.newCall(request).execute().use { response ->
            if (response.isSuccessful) {
                // Parse the HTML content of the page using Jsoup
                val html = response.body!!.string()
                val document = Jsoup.parse(html)
    
                // Find all rows in the table
                val rows = document.select("tr")
    
                // Initialize variables to keep track of the current article and row type
                var currentArticle = ""
                var currentRowType = ""
    
                // Iterate through the rows to scrape articles
                for (row in rows) {
                    if (row.hasClass("athing")) {
                        // This is an article row
                        currentArticle = row.toString()
                        currentRowType = "article"
                    } else if (currentRowType == "article") {
                        // This is the details row
                        if (currentArticle.isNotEmpty()) {
                            // Extract information from the current article and details row
                            val titleElem = row.selectFirst("span.titleline")
                            if (titleElem != null) {
                                val articleTitle = titleElem.select("a").text()
                                val articleUrl = titleElem.select("a").attr("href")
    
                                val subtext = row.selectFirst("td.subtext")
                                val points = subtext?.selectFirst("span.score")?.text() ?: "0"
                                val author = subtext?.selectFirst("a.hnuser")?.text() ?: ""
                                val timestamp = subtext?.selectFirst("span.age")?.attr("title") ?: ""
                                val commentsElem = subtext?.selectFirst("a:contains(comments)")
                                val comments = commentsElem?.text() ?: "0"
    
                                // Print the extracted information
                                println("Title: $articleTitle")
                                println("URL: $articleUrl")
                                println("Points: $points")
                                println("Author: $author")
                                println("Timestamp: $timestamp")
                                println("Comments: $comments")
                                println("-".repeat(50))  // Separating articles
                            }
                        }
    
                        // Reset the current article and row type
                        currentArticle = ""
                        currentRowType = ""
                    } else if (row.attr("style") == "height:5px") {
                        // This is the spacer row, skip it
                        continue
                    }
                }
            } else {
                println("Failed to retrieve the page. Status code: ${response.code}")
            }
        }
    }

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
    
    

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!