Scraping All Images from a Website with Kotlin

This article will provide a practical, step-by-step guide to scraping all images from a website using real Kotlin code. We will focus specifically on explaining the code to extract images, without providing any unnecessary background on web scraping basics or the Kotlin language.

This is page we are talking about… We will be scraping images of dog breed from wikipedia

Importing Libraries

We first import the libraries needed to send HTTP requests and parse HTML:

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element

Jsoup will be used to connect to the web page, send a request, and parse the HTML document.

Defining Key Variables

Next we define the URL of the Wikipedia page we want to scrape:

val url = "<https://commons.wikimedia.org/wiki/List_of_dog_breeds>"

We also define a user agent header to simulate a real browser request:

val userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"

Sending Request and Parsing HTML

We use Jsoup to connect to the URL and send a GET request with the user agent specified:

val doc = Jsoup.connect(url).userAgent(userAgent).get()

This parses and loads the full HTML document into the doc variable.

Selecting Target Table

Inspecting the page

You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

We next select the table element containing the data we want to scrape - dog breed information:

val table = doc.select("table.wikitable.sortable").first()

This uses a CSS selector to target the table uniquely identified by the "wikitable sortable" classes.

Initializing Storage Lists

We initialize empty lists to store the data extracted from the table:

val names = mutableListOf<String>()
val groups = mutableListOf<String>()
val localNames = mutableListOf<String>()
val photographs = mutableListOf<String>()

Creating Image Folder

Since we want to download the dog images, we create a folder to save them:

val imageFolder = File("dog_images")
imageFolder.mkdirs()

The mkdirs() call ensures the folder is created if it doesn't already exist.

Extracting Data from Table Rows

This is where the main data extraction occurs from the HTML.

We loop through each row, skipping the header:

for (row: Element in table.select("tr").drop(1)) {

// extract data from each row

}

The key part is using selectors to extract elements from each column:

val columns = row.select("td, th")

val name = columns[0].select("a").text().trim()

val group = columns[1].text().trim()

val spanTag = columns[2].select("span").first()
val localName = spanTag?.text()?.trim() ?: ""

val imgTag = columns[3].select("img").first()
val photograph = imgTag?.attr("src") ?: ""

This shows how to:

Select and elements from each row

Extract text from links and text nodes

Handle missing elements using null-safe operators

Get image attributes like src

The data extracted is added to the previously defined storage lists.

Downloading and Saving Images

For each non-blank image link extracted, we download and save the image:

if (photograph.isNotBlank()) {

    val imageFileName = File(imageFolder, "$name.jpg")

    downloadImage(photograph, imageFileName)

}

The downloadImage() function handles fetching the image binary data and saving it to the destination file.

Printing Extracted Data

Finally, we can print out or process the extracted data now available in the lists:

for (i in names.indices) {

    println("Name: ${names[i]}")

    println("Group: ${groups[i]}")

    // etc

}

This allows us to work with each piece of scraped data from the site.

Full Code

Here is the full code for reference:

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element
import java.io.File
import java.io.FileOutputStream
import java.io.IOException
import java.io.InputStream
import java.net.URL
import java.nio.file.Files
import java.nio.file.Path
import java.nio.file.StandardCopyOption

fun main() {
    // URL of the Wikipedia page
    val url = "https://commons.wikimedia.org/wiki/List_of_dog_breeds"

    // Define a user-agent header to simulate a browser request
    val userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"

    // Send an HTTP GET request to the URL with the headers
    val doc = Jsoup.connect(url).userAgent(userAgent).get()

    // Find the table with class 'wikitable sortable'
    val table = doc.select("table.wikitable.sortable").first()

    // Initialize lists to store the data
    val names = mutableListOf<String>()
    val groups = mutableListOf<String>()
    val localNames = mutableListOf<String>()
    val photographs = mutableListOf<String>()

    // Create a folder to save the images
    val imageFolder = File("dog_images")
    imageFolder.mkdirs()

    // Iterate through rows in the table (skip the header row)
    for (row: Element in table.select("tr").drop(1)) {
        val columns = row.select("td, th")

        if (columns.size == 4) {
            // Extract data from each column
            val name = columns[0].select("a").text().trim()
            val group = columns[1].text().trim()

            // Check if the second column contains a span element
            val spanTag = columns[2].select("span").first()
            val localName = spanTag?.text()?.trim() ?: ""

            // Check for the existence of an image tag within the fourth column
            val imgTag = columns[3].select("img").first()
            val photograph = imgTag?.attr("src") ?: ""

            // Download the image and save it to the folder
            if (photograph.isNotBlank()) {
                val imageFileName = File(imageFolder, "$name.jpg")
                downloadImage(photograph, imageFileName)
            }

            // Append data to respective lists
            names.add(name)
            groups.add(group)
            localNames.add(localName)
            photographs.add(photograph)
        }
    }

    // Print or process the extracted data as needed
    for (i in names.indices) {
        println("Name: ${names[i]}")
        println("FCI Group: ${groups[i]}")
        println("Local Name: ${localNames[i]}")
        println("Photograph: ${photographs[i]}")
        println()
    }
}

@Throws(IOException::class)
fun downloadImage(imageUrl: String, destination: File) {
    val url = URL(imageUrl)
    val inputStream: InputStream = url.openStream()
    Files.copy(inputStream, destination.toPath(), StandardCopyOption.REPLACE_EXISTING)
    inputStream.close()
}

The key concepts covered:

Using Jsoup to scrape HTML

Selecting elements using CSS selectors

Extracting and storing needed data

Downloading binary assets like images

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Scraping All Images from a Website with Kotlin

Importing Libraries

Defining Key Variables

Sending Request and Parsing HTML

Selecting Target Table

Inspecting the page

Initializing Storage Lists

Creating Image Folder

Extracting Data from Table Rows

Downloading and Saving Images

Printing Extracted Data

Full Code

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping All Images from a Website with Kotlin

Importing Libraries

Defining Key Variables

Sending Request and Parsing HTML

Selecting Target Table

Inspecting the page

Initializing Storage Lists

Creating Image Folder

Extracting Data from Table Rows

Downloading and Saving Images

Printing Extracted Data

Full Code

The easiest way to do Web Scraping

Don't leave just yet!