Scraping Multiple Pages in Kotlin with HTTP Client and kotlinx.html

Oct 15, 2023 · 4 min read

Web scraping is useful to programmatically extract data from websites. Often you need to scrape multiple pages from a site to gather complete information. In this article, we'll see how to scrape multiple pages in Kotlin using the native HTTP client and kotlinx.html libraries.

Prerequisites

To follow along, you'll need:

  • Basic Kotlin knowledge
  • Kotlin installed
  • kotlinx.html library added:
  • implementation("org.jetbrains.kotlinx:kotlinx-html-jvm:0.7.3")
    

    Import Libraries

    We'll need the following imports:

    import kotlinx.html.*
    import kotlinx.html.dom.*
    import java.net.HttpURLConnection
    

    Define Base URL

    We'll scrape a blog - https://copyblogger.com/blog/. The page URLs follow a pattern:

    <https://copyblogger.com/blog/>
    <https://copyblogger.com/blog/page/2/>
    <https://copyblogger.com/blog/page/3/>
    

    Let's define the base URL pattern:

    val baseUrl = "<https://copyblogger.com/blog/page/%d/>"
    

    The %d allows us to insert the page number.

    Specify Number of Pages

    Next, we'll specify how many pages to scrape. Let's scrape the first 5 pages:

    val numPages = 5
    

    Loop Through Pages

    We can now loop from 1 to numPages and construct the URL for each page:

    for (page in 1..numPages) {
    
      // Construct page URL
      val url = baseUrl.format(page)
    
      // Code to scrape each page
    
    }
    

    Send Request and Parse HTML

    Inside the loop, we'll send a GET request and parse the HTML:

    val connection = URL(url).openConnection() as HttpURLConnection
    connection.inputStream.bufferedReader().use {
    
      val doc = it.readText()
      val parsed = parse(doc)
    
    }
    

    This gives us a parsed HTML document to extract data from.

    Extract Data

    Now within the loop we can use CSS selectors to find and extract data from each page:

    for (article in parsed.select("article")) {
    
      // Extract data from article
    
      val title = article.select("h2.entry-title").text()
      val url = article.select("a.entry-title-link").attr("href")
      val author = article.select("div.post-author a").text()
    
    }
    

    Full Code

    Our full code to scrape 5 pages is:

    You're absolutely right, I missed extracting the categories in my Kotlin web scraping example. Here is the updated full code including fetching the categories:
    
    ```kotlin
    import kotlinx.html.*
    import kotlinx.html.dom.*
    import java.net.HttpURLConnection
    
    fun main() {
    
      val baseUrl = "https://copyblogger.com/blog/page/%d/"
      val numPages = 5
    
      for (page in 1..numPages) {
    
        val url = baseUrl.format(page)
    
        val connection = URL(url).openConnection() as HttpURLConnection
        connection.inputStream.bufferedReader().use {
    
          val doc = it.readText()
          val parsed = parse(doc)
    
          for (article in parsed.select("article")) {
    
            val title = article.select("h2.entry-title").text()
            val url = article.select("a.entry-title-link").attr("href")
            val author = article.select("div.post-author a").text()
            
            val categories = article.select("div.entry-categories a").map { it.text() }
            
            println("Title: $title")
            println("URL: $url")
            println("Author: $author")
            println("Categories: $categories")
            
          }
    
        }
    
      }
      
    }
    ```
    
    I've added the missing code to extract the categories into a list and print it. Let me know if you need any other changes to the Kotlin scraping example!

    This allows us to scrape and extract data from multiple pages sequentially. The code can be extended to scrape any number of pages.

    Summary

  • Use a base URL pattern with %d placeholder
  • Loop through pages with for loop
  • Construct each page URL
  • Send request and parse HTML
  • Extract data using CSS selectors
  • Print or store scraped data
  • Web scraping enables collecting large datasets programmatically. With the techniques here, you can scrape and extract information from multiple pages of a website in Kotlin.

    While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

    Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

    This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

    With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!