Scraping Reddit Posts in Kotlin

In this beginner-friendly tutorial, we will be scraping information from Reddit posts using a simple Kotlin script. We will send a request to Reddit, download the HTML content, parse it, and extract key data like title, author, score etc.

here is the page we are talking about

Importing Libraries

We need two external libraries for this script:

khttp - To send HTTP requests to the Reddit URL

Jsoup - To parse and process the HTML content

import khttp.get
import org.jsoup.Jsoup

No need to understand these libraries in depth right now. Just know that khttp gets web content and Jsoup processes HTML.

Sending Request

We define the Reddit URL and a User-Agent header:

val redditUrl = "<https://www.reddit.com>"

val headers = mapOf(
   "User-Agent" to "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
)

The User-Agent makes sure Reddit knows we are a browser.

Then we send a GET request and store the response:

val response = get(redditUrl, headers = headers)

We are retrieving the html content of Reddit frontpage.

Saving HTML Content

We check if our request was successful:

if (response.statusCode == 200) {

   // process response

} else {

   // request failed
}

Status code 200 means our GET request succeeded. We save the HTML text to a file:

val htmlContent = response.text

val filename = "reddit_page.html"

java.io.File(filename).writeText(htmlContent, Charsets.UTF_8)

The HTML of Reddit frontpage is now saved locally to process further.

Parsing HTML

To extract information, we need to parse the HTML content. Jsoup helps parse and traverse HTML documents:

val document = Jsoup.parse(htmlContent)

We have a parsed representation of the entire Reddit frontpage HTML.

Extracting Data

This is where we actually scrape information from the Reddit posts. We use selectors to find elements and extract data.

Understanding Selectors

Selectors let us query elements in the HTML document like a database. Some examples:

/* Tag and class */
div.post

/* Nested tags */
article > div.post-title

/* Attributes */
a[href^='/r/']

We use CSS-style selectors to target specific elements on the page. Here's how it works:

Tags and Classes

Select div tags with class post:

div.post

Matches

Nesting

Select div tags inside article tags:

article > div

Matches

Hello

Attributes

Anchor tags a with href starting with /r/:

a[href^='/r/']

Matches

This lets us precisely target elements to extract data from.

Selecting Reddit Post Blocks

Inspecting the elements

Upon inspecting the HTML in Chrome, you will see that each of the posts have a particular element shreddit-post and class descriptors specific to them…

In our code, we select Reddit posts using the shreddit-post class and other attributes:

val blocks = document.select("shreddit-post.block.relative.cursor-pointer.bg-neutral-background.focus-within:bg-neutral-background-hover.hover:bg-neutral-background-hover.xs:rounded-[16px].p-md.my-2xs.nd:visible")

This complex selector targets Reddit post blocks on the page. Let's break it down:

shreddit-post - The class for post blocks

block relative - Styling classes applied to posts

bg-neutral-background - More styling classes for background

focus-within hover - Classes applied on hover/focus

xs:rounded - Rounded corners on small screens

p-md my-2xs - Padding and margin classes

nd:visible - Visibility class

So in simple terms, we are selecting post blocks by the shreddit-post class. The other classes narrow down styling to extract actual posts.

Advanced selectors let us hone in on the exact set of elements we want. We could also use IDs or other attributes to target elements.

Extracting Post Data

Inside the selected post blocks, we can extract information:

for (block in blocks) {

  val permalink = block.attr("permalink")

  val contentHref = block.attr("content-href")

  // extract other attributes..

}

The attr() method gets an attribute value from the element. For example, permalink contains the post URL, author has the Reddit username etc.

Some key attributes we are extracting:

permalink - Post URL
contentHref - URL to comments
commentCount - Number of comments
postTitle - Title of the post
author - Username of poster
score - Upvote count

And that's it! We have extracted the data we wanted from Reddit posts. The output prints this information for each post.

The full code again:

import khttp.get
import org.jsoup.Jsoup

fun main() {
    // Define the Reddit URL you want to download
    val redditUrl = "https://www.reddit.com"

    // Define a User-Agent header
    val headers = mapOf(
        "User-Agent" to "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    )

    // Send a GET request to the URL with the User-Agent header
    val response = get(redditUrl, headers = headers)

    // Check if the request was successful (status code 200)
    if (response.statusCode == 200) {
        // Get the HTML content of the page
        val htmlContent = response.text

        // Specify the filename to save the HTML content
        val filename = "reddit_page.html"

        // Save the HTML content to a file
        java.io.File(filename).writeText(htmlContent, Charsets.UTF_8)

        println("Reddit page saved to $filename")

        // Parse the HTML content
        val document = Jsoup.parse(htmlContent)

        // Find all blocks with the specified tag and class
        val blocks = document.select("shreddit-post.block.relative.cursor-pointer.bg-neutral-background.focus-within:bg-neutral-background-hover.hover:bg-neutral-background-hover.xs:rounded-[16px].p-md.my-2xs.nd:visible")

        // Iterate through the blocks and extract information from each one
        for (block in blocks) {
            val permalink = block.attr("permalink")
            val contentHref = block.attr("content-href")
            val commentCount = block.attr("comment-count")
            val postTitle = block.select("div[slot=title]").text().trim()
            val author = block.attr("author")
            val score = block.attr("score")

            // Print the extracted information for each block
            println("Permalink: $permalink")
            println("Content Href: $contentHref")
            println("Comment Count: $commentCount")
            println("Post Title: $postTitle")
            println("Author: $author")
            println("Score: $score")
            println()
        }
    } else {
        println("Failed to download Reddit page (status code ${response.statusCode})")
    }
}

While scrapers can get complex with handling JavaScript, cookies etc - this shows the basic concepts like sending requests, parsing HTML, and using selectors to extract data.

Browse by language:

Scraping Reddit Posts in Kotlin

Importing Libraries

Sending Request

Saving HTML Content

Parsing HTML

Extracting Data

Understanding Selectors

Tags and Classes

Nesting

Attributes

Selecting Reddit Post Blocks

Extracting Post Data

Browse by tags:

The easiest way to do Web Scraping

Scraping Reddit Posts in Kotlin

Importing Libraries

Sending Request

Saving HTML Content

Parsing HTML

Extracting Data

Understanding Selectors

Tags and Classes

Nesting

Attributes

Selecting Reddit Post Blocks

Extracting Post Data

The easiest way to do Web Scraping

Don't leave just yet!