Scraping All Images from a Website with Scala

Dec 13, 2023 · 8 min read

In this beginner-friendly guide, we will go through how to use Scala to scrape all images from a website. We won't go into web scraping theory -- instead we'll focus on understanding and running real-world code that accomplishes this specific task of extracting images.

This is page we are talking about… We will be scraping images of dog breeds from Wikipedia

Prerequisites

To follow along, you'll need:

  • JDK 8+
  • SBT
  • Jsoup library
  • First, make sure Java and SBT are installed on your system. Then add the Jsoup dependency in your SBT build file:

    libraryDependencies += "org.jsoup" % "jsoup" % "1.13.1"
    

    Run sbt update to download the Jsoup library.

    Now let's dive into the code!

    Setup

    We start by importing the required Scala and Java libraries:

    import java.io.{File, FileOutputStream}
    
    import org.jsoup.Jsoup
    
    import scala.collection.JavaConverters._
    

    The key one is Jsoup which we'll use for scraping the web page.

    Next, we define the URL of the Wikipedia page we want to scrape:

    val url = "<https://commons.wikimedia.org/wiki/List_of_dog_breeds>"
    

    No modifications needed for the URL string. We hard-code it exactly as shown.

    We also set a user agent header to simulate a real browser request:

    val userAgent =
      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    

    Preserving the user agent string literal here.

    Making the HTTP Request

    With the imports and variables defined, we can now use Jsoup to connect to the URL and get the HTML document:

    val doc = Jsoup.connect(url).userAgent(userAgent).get()
    

    The userAgent() method sets the custom header we defined earlier.

    Selecting the Table Element

    Inspecting the page

    You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

    We use a CSS selector to select the table with class names "wikitable" and "sortable":

    val table = doc.select("table.wikitable.sortable").first()
    

    The select() method finds all matching elements, and we take the first one using .first().

    Initializing Storage Lists

    With the target table element selected, we initialize some mutable Scala lists to store the extracted data:

    val names = scala.collection.mutable.ListBuffer.empty[String]
    
    val groups = scala.collection.mutable.ListBuffer.empty[String]
    
    val localNames = scala.collection.mutable.ListBuffer.empty[String]
    
    val photographs = scala.collection.mutable.ListBuffer.empty[String]
    

    One list per data field we will scrape.

    Creating Image Folder

    Since we want to download all images from the page, we need a folder to save them:

    val folder = new File("dog_images")
    
    if (!folder.exists()) folder.mkdirs()
    

    This creates a "dog_images" folder if it doesn't already exist.

    Extracting Data from Table Rows

    Now for the most complex part - extracting data within each row of the selected table element.

    We loop through the rows, skipping the header:

    for (row <- table.select("tr").asScala.drop(1)) {
    
    // extract data from each row
    
    }
    

    Inside the loop, we select all "td" (normal cells) and "th" (header cells) within each row:

    val columns = row.select("td, th").asScala.toSeq
    

    And check that 4 columns were found - otherwise skip the row:

    if (columns.size == 4) {
    
      // extract data from each column
    
    }
    

    Extracting the Name

    To extract the breed name, we select the anchor tag inside the first column:

    val name = columns(0).select("a").text().trim
    

    The select("a") finds the anchor element, text() gets the text inside it, and trim removes whitespace.

    We add the scraped name string to the respective list:

    Extracting the Group

    The group name exists directly as text in the second column. We extract it through:

    Extracting the Local Name

    Next, we check if there is a tag inside the third column:

    If found, we get the text inside element. Otherwise set to empty string.

    And add to the localNames list:

    Extracting the Image

    Finally, we check if there is an tag inside the 4th column:

    If an image exists, we extract its src attribute to get the image URL.

    Downloading and Saving Images

    For each found image, we download and save it to our folder:

    We use the breed name in each image filename.

    And add the photograph URL to its list:

    This whole process repeats for every row, extracting and storing data from each column.

    Printing the Extracted Data

    Finally, we can print out or process the scraped data as needed:

    So in summary, we:

    1. Made an HTTP request for the web page HTML
    2. Selected the data table element
    3. Initialized lists to store extracted data
    4. Looped through rows, extracting information
    5. Downloaded and saved images
    6. Printed/processed scraped data

    Full Code

    import java.io.{File, FileOutputStream}
    
    import org.jsoup.Jsoup
    
    import scala.collection.JavaConverters._
    
    object DogBreedsScraper {
    
      def main(args: Array[String]): Unit = {
    
        // URL of the Wikipedia page
        val url = "<https://commons.wikimedia.org/wiki/List_of_dog_breeds>"
    
        // Define a user-agent header to simulate a browser request
        val userAgent =
          "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    
        // Send an HTTP GET request to the URL with the headers
        val doc = Jsoup.connect(url).userAgent(userAgent).get()
    
        // Find the table with class 'wikitable sortable'
        val table = doc.select("table.wikitable.sortable").first()
    
        // Initialize lists to store the data
        val names = scala.collection.mutable.ListBuffer.empty[String]
        val groups = scala.collection.mutable.ListBuffer.empty[String]
        val localNames = scala.collection.mutable.ListBuffer.empty[String]
        val photographs = scala.collection.mutable.ListBuffer.empty[String]
    
        // Create a folder to save the images
        val folder = new File("dog_images")
    
        if (!folder.exists()) folder.mkdirs()
    
        // Iterate through rows in the table (skip the header row)
        for (row <- table.select("tr").asScala.drop(1)) {
    
          val columns = row.select("td, th").asScala.toSeq
    
          if (columns.size == 4) {
    
            // Extract data from each column
            val name = columns(0).select("a").text().trim
            val group = columns(1).text().trim
    
            // Check if the second column contains a span element
            val spanTag = columns(2).select("span").first()
            val localName = if (spanTag != null) spanTag.text().trim else ""
    
            // Check for the existence of an image tag within the fourth column
            val imgTag = columns(3).select("img").first()
            val photograph = if (imgTag != null) imgTag.attr("src") else ""
    
            // Download the image and save it to the folder
            if (!photograph.isEmpty) {
    
              val imageStream = Jsoup.connect(photograph).ignoreContentType(true).execute().bodyAsBytes()
              val imageFile = new File(folder, s"${name}.jpg")
              val fos = new FileOutputStream(imageFile)
              fos.write(imageStream)
              fos.close()
    
            }
    
            // Append data to respective lists
            names += name
            groups += group
            localNames += localName
            photographs += photograph
    
          }
    
        }
    
        // Print or process the extracted data as needed
        for (i <- names.indices) {
    
          println("Name:", names(i))
          println("FCI Group:", groups(i))
          println("Local Name:", localNames(i))
          println("Photograph:", photographs(i))
          println()
    
        }
    
      }
    
    }
    

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!