Scraping Hacker News with Kotlin

Hacker News is a popular social news website in the technology and startup community. In this beginner tutorial, we will scrape key details from Hacker News article listings using Kotlin and print article titles, URLs, points, authors, timestamps and comment counts.

This is the page we are talking about…

Imports

Let's look at the imports needed:

import okhttp3.OkHttpClient
import okhttp3.Request
import org.jsoup.Jsoup

OkHttpClient - Makes HTTP requests to fetch content

Request - Represents an HTTP request

Jsoup - Java library to parse and extract data from HTML

These dependencies allow us to retrieve the Hacker News page and parse its content.

Fetching the Page

First we define the Hacker News homepage URL:

val url = "<https://news.ycombinator.com/>"

Next we create an OkHttpClient instance to make requests:

val client = OkHttpClient()

And build a simple GET request for the URL:

val request = Request.Builder()
    .url(url)
    .build()

Finally we execute the request and handle the response:

client.newCall(request).execute().use { response ->

    // Parse response here

}

This sends the GET request and we access the response body inside the lambda.

Parsing the Page with Jsoup

Inside the response handler, we first parse the HTML:

val html = response.body!!.string()
val document = Jsoup.parse(html)

This loads up a Jsoup Document object we can now query to extract data.

Scraping Rows with Selectors

Inspecting the page

You can notice that the items are housed inside a tag with the class athing

Jsoup uses CSS-style selectors to find elements. Let's get all rows from the table:

val rows = document.select("tr")

Matches:

<tr>...</tr>
<tr class="athing">...</tr>
...

We iterate over the rows, keeping track of article and row type:

var currentArticle = ""
var currentRowType = ""

for (row in rows) {

   // Check type of row
   // Extract data
   // Update current* variables

}

Getting Article Rows

We first check if a row has the "athing" class - this denotes an article listing:

if (row.hasClass("athing")) {

    currentArticle = row.toString()
    currentRowType = "article"

}

We save the full HTML of the article row to scrape details next.

Scraping Article Details

After an article row, the next row contains key details like title, URL, points etc. We handle this case:

} else if (currentRowType == "article") {

    // Extract article details here

    // Reset current* variables
    currentArticle = ""
    currentRowType = ""

}

Inside here we use other selectors to extract specific parts:

Title and URL

val titleElem = row.selectFirst("span.titleline")

if (titleElem != null) {

    val articleTitle = titleElem.select("a").text()
    val articleUrl = titleElem.select("a").attr("href")

}

Matches:

<span class="titleline">
  <a href="item?id=37497275">Ask HN: Am I the only one still using Vim?</a>
</span>

We specifically get the link text and href attribute.

Other Details

Similarly, we fetch points, author, timestamp, comments:

val subtext = row.selectFirst("td.subtext")

val points = subtext?.selectFirst("span.score")?.text() ?: "0"
val author = subtext?.selectFirst("a.hnuser")?.text() ?: ""
val timestamp = subtext?.selectFirst("span.age")?.attr("title") ?: ""

val commentsElem = subtext?.selectFirst("a:contains(comments)")
val comments = commentsElem?.text() ?: "0"

Using the td.subtext element and other descriptors inside it. We null check and provide defaults where needed.

Printing Extracted Article Details

Finally, we print all extracted details:

println("Title: $articleTitle")
println("URL: $articleUrl")
println("Points: $points")
println("Author: $author")
println("Timestamp: $timestamp")
println("Comments: $comments")

println("-".repeat(50)) // Separator

This outputs each article's details!

The full code is below to scrape Hacker News in Kotlin. Hopefully this gives a good look at using real-world libraries like OkHttp and Jsoup along with CSS selectors to easily extract content from websites.

import okhttp3.OkHttpClient
import okhttp3.Request
import org.jsoup.Jsoup

fun main() {
    // Define the URL of the Hacker News homepage
    val url = "https://news.ycombinator.com/"

    // Create an OkHttpClient instance
    val client = OkHttpClient()

    // Create a GET request
    val request = Request.Builder()
        .url(url)
        .build()

    // Send the GET request and handle the response
    client.newCall(request).execute().use { response ->
        if (response.isSuccessful) {
            // Parse the HTML content of the page using Jsoup
            val html = response.body!!.string()
            val document = Jsoup.parse(html)

            // Find all rows in the table
            val rows = document.select("tr")

            // Initialize variables to keep track of the current article and row type
            var currentArticle = ""
            var currentRowType = ""

            // Iterate through the rows to scrape articles
            for (row in rows) {
                if (row.hasClass("athing")) {
                    // This is an article row
                    currentArticle = row.toString()
                    currentRowType = "article"
                } else if (currentRowType == "article") {
                    // This is the details row
                    if (currentArticle.isNotEmpty()) {
                        // Extract information from the current article and details row
                        val titleElem = row.selectFirst("span.titleline")
                        if (titleElem != null) {
                            val articleTitle = titleElem.select("a").text()
                            val articleUrl = titleElem.select("a").attr("href")

                            val subtext = row.selectFirst("td.subtext")
                            val points = subtext?.selectFirst("span.score")?.text() ?: "0"
                            val author = subtext?.selectFirst("a.hnuser")?.text() ?: ""
                            val timestamp = subtext?.selectFirst("span.age")?.attr("title") ?: ""
                            val commentsElem = subtext?.selectFirst("a:contains(comments)")
                            val comments = commentsElem?.text() ?: "0"

                            // Print the extracted information
                            println("Title: $articleTitle")
                            println("URL: $articleUrl")
                            println("Points: $points")
                            println("Author: $author")
                            println("Timestamp: $timestamp")
                            println("Comments: $comments")
                            println("-".repeat(50))  // Separating articles
                        }
                    }

                    // Reset the current article and row type
                    currentArticle = ""
                    currentRowType = ""
                } else if (row.attr("style") == "height:5px") {
                    // This is the spacer row, skip it
                    continue
                }
            }
        } else {
            println("Failed to retrieve the page. Status code: ${response.code}")
        }
    }
}

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"

We have a running offer of 1000 API calls completely free. Register and get your free API Key.

Scraping Hacker News with Kotlin

Imports

Fetching the Page

Parsing the Page with Jsoup

Scraping Rows with Selectors

Getting Article Rows

Scraping Article Details

Title and URL

Other Details

Printing Extracted Article Details

Browse by language:

The easiest way to do Web Scraping

Scraping Hacker News with Kotlin

Imports

Fetching the Page

Parsing the Page with Jsoup

Scraping Rows with Selectors

Getting Article Rows

Scraping Article Details

Title and URL

Other Details

Printing Extracted Article Details

The easiest way to do Web Scraping

Don't leave just yet!