Scraping All Images from a Website with Kotlin

Dec 13, 2023 · 7 min read

This article will provide a practical, step-by-step guide to scraping all images from a website using real Kotlin code. We will focus specifically on explaining the code to extract images, without providing any unnecessary background on web scraping basics or the Kotlin language.

This is page we are talking about… We will be scraping images of dog breed from wikipedia

Importing Libraries

We first import the libraries needed to send HTTP requests and parse HTML:

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element

Jsoup will be used to connect to the web page, send a request, and parse the HTML document.

Defining Key Variables

Next we define the URL of the Wikipedia page we want to scrape:

val url = "<https://commons.wikimedia.org/wiki/List_of_dog_breeds>"

We also define a user agent header to simulate a real browser request:

val userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"

Sending Request and Parsing HTML

We use Jsoup to connect to the URL and send a GET request with the user agent specified:

val doc = Jsoup.connect(url).userAgent(userAgent).get()

This parses and loads the full HTML document into the doc variable.

Selecting Target Table

Inspecting the page

You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

We next select the table element containing the data we want to scrape - dog breed information:

val table = doc.select("table.wikitable.sortable").first()

This uses a CSS selector to target the table uniquely identified by the "wikitable sortable" classes.

Initializing Storage Lists

We initialize empty lists to store the data extracted from the table:

val names = mutableListOf<String>()
val groups = mutableListOf<String>()
val localNames = mutableListOf<String>()
val photographs = mutableListOf<String>()

Creating Image Folder

Since we want to download the dog images, we create a folder to save them:

val imageFolder = File("dog_images")
imageFolder.mkdirs()

The mkdirs() call ensures the folder is created if it doesn't already exist.

Extracting Data from Table Rows

This is where the main data extraction occurs from the HTML.

We loop through each row, skipping the header:

for (row: Element in table.select("tr").drop(1)) {

// extract data from each row

}

The key part is using selectors to extract elements from each column:

val columns = row.select("td, th")

val name = columns[0].select("a").text().trim()

val group = columns[1].text().trim()

val spanTag = columns[2].select("span").first()
val localName = spanTag?.text()?.trim() ?: ""

val imgTag = columns[3].select("img").first()
val photograph = imgTag?.attr("src") ?: ""

This shows how to:

  • Select and elements from each row
  • Extract text from links and text nodes
  • Handle missing elements using null-safe operators
  • Get image attributes like src
  • The data extracted is added to the previously defined storage lists.

    Downloading and Saving Images

    For each non-blank image link extracted, we download and save the image:

    if (photograph.isNotBlank()) {
    
        val imageFileName = File(imageFolder, "$name.jpg")
    
        downloadImage(photograph, imageFileName)
    
    }
    

    The downloadImage() function handles fetching the image binary data and saving it to the destination file.

    Printing Extracted Data

    Finally, we can print out or process the extracted data now available in the lists:

    for (i in names.indices) {
    
        println("Name: ${names[i]}")
    
        println("Group: ${groups[i]}")
    
        // etc
    
    }
    

    This allows us to work with each piece of scraped data from the site.

    Full Code

    Here is the full code for reference:

    import org.jsoup.Jsoup
    import org.jsoup.nodes.Document
    import org.jsoup.nodes.Element
    import java.io.File
    import java.io.FileOutputStream
    import java.io.IOException
    import java.io.InputStream
    import java.net.URL
    import java.nio.file.Files
    import java.nio.file.Path
    import java.nio.file.StandardCopyOption
    
    fun main() {
        // URL of the Wikipedia page
        val url = "https://commons.wikimedia.org/wiki/List_of_dog_breeds"
    
        // Define a user-agent header to simulate a browser request
        val userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    
        // Send an HTTP GET request to the URL with the headers
        val doc = Jsoup.connect(url).userAgent(userAgent).get()
    
        // Find the table with class 'wikitable sortable'
        val table = doc.select("table.wikitable.sortable").first()
    
        // Initialize lists to store the data
        val names = mutableListOf<String>()
        val groups = mutableListOf<String>()
        val localNames = mutableListOf<String>()
        val photographs = mutableListOf<String>()
    
        // Create a folder to save the images
        val imageFolder = File("dog_images")
        imageFolder.mkdirs()
    
        // Iterate through rows in the table (skip the header row)
        for (row: Element in table.select("tr").drop(1)) {
            val columns = row.select("td, th")
    
            if (columns.size == 4) {
                // Extract data from each column
                val name = columns[0].select("a").text().trim()
                val group = columns[1].text().trim()
    
                // Check if the second column contains a span element
                val spanTag = columns[2].select("span").first()
                val localName = spanTag?.text()?.trim() ?: ""
    
                // Check for the existence of an image tag within the fourth column
                val imgTag = columns[3].select("img").first()
                val photograph = imgTag?.attr("src") ?: ""
    
                // Download the image and save it to the folder
                if (photograph.isNotBlank()) {
                    val imageFileName = File(imageFolder, "$name.jpg")
                    downloadImage(photograph, imageFileName)
                }
    
                // Append data to respective lists
                names.add(name)
                groups.add(group)
                localNames.add(localName)
                photographs.add(photograph)
            }
        }
    
        // Print or process the extracted data as needed
        for (i in names.indices) {
            println("Name: ${names[i]}")
            println("FCI Group: ${groups[i]}")
            println("Local Name: ${localNames[i]}")
            println("Photograph: ${photographs[i]}")
            println()
        }
    }
    
    @Throws(IOException::class)
    fun downloadImage(imageUrl: String, destination: File) {
        val url = URL(imageUrl)
        val inputStream: InputStream = url.openStream()
        Files.copy(inputStream, destination.toPath(), StandardCopyOption.REPLACE_EXISTING)
        inputStream.close()
    }

    The key concepts covered:

  • Using Jsoup to scrape HTML
  • Selecting elements using CSS selectors
  • Extracting and storing needed data
  • Downloading binary assets like images
  • In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: