In this beginner-friendly guide, we will go through how to use Scala to scrape all images from a website. We won't go into web scraping theory -- instead we'll focus on understanding and running real-world code that accomplishes this specific task of extracting images.
This is page we are talking about… We will be scraping images of dog breeds from Wikipedia
Prerequisites
To follow along, you'll need:
First, make sure Java and SBT are installed on your system. Then add the Jsoup dependency in your SBT build file:
libraryDependencies += "org.jsoup" % "jsoup" % "1.13.1"
Run
Now let's dive into the code!
Setup
We start by importing the required Scala and Java libraries:
import java.io.{File, FileOutputStream}
import org.jsoup.Jsoup
import scala.collection.JavaConverters._
The key one is Jsoup which we'll use for scraping the web page.
Next, we define the URL of the Wikipedia page we want to scrape:
val url = "<https://commons.wikimedia.org/wiki/List_of_dog_breeds>"
No modifications needed for the URL string. We hard-code it exactly as shown.
We also set a user agent header to simulate a real browser request:
val userAgent =
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
Preserving the user agent string literal here.
Making the HTTP Request
With the imports and variables defined, we can now use Jsoup to connect to the URL and get the HTML document:
val doc = Jsoup.connect(url).userAgent(userAgent).get()
The userAgent() method sets the custom header we defined earlier.
Selecting the Table Element
Inspecting the page
You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable
We use a CSS selector to select the table with class names "wikitable" and "sortable":
val table = doc.select("table.wikitable.sortable").first()
The select() method finds all matching elements, and we take the first one using .first().
Initializing Storage Lists
With the target table element selected, we initialize some mutable Scala lists to store the extracted data:
val names = scala.collection.mutable.ListBuffer.empty[String]
val groups = scala.collection.mutable.ListBuffer.empty[String]
val localNames = scala.collection.mutable.ListBuffer.empty[String]
val photographs = scala.collection.mutable.ListBuffer.empty[String]
One list per data field we will scrape.
Creating Image Folder
Since we want to download all images from the page, we need a folder to save them:
val folder = new File("dog_images")
if (!folder.exists()) folder.mkdirs()
This creates a "dog_images" folder if it doesn't already exist.
Extracting Data from Table Rows
Now for the most complex part - extracting data within each row of the selected table element.
We loop through the rows, skipping the header:
for (row <- table.select("tr").asScala.drop(1)) {
// extract data from each row
}
Inside the loop, we select all "td" (normal cells) and "th" (header cells) within each row:
val columns = row.select("td, th").asScala.toSeq
And check that 4 columns were found - otherwise skip the row:
if (columns.size == 4) {
// extract data from each column
}
Extracting the Name
To extract the breed name, we select the anchor tag inside the first column:
val name = columns(0).select("a").text().trim
The select("a") finds the
We add the scraped name string to the respective list:
Extracting the Group
The group name exists directly as text in the second column. We extract it through:
Extracting the Local Name
Next, we check if there is a
If found, we get the text inside
And add to the localNames list:
Extracting the Image
Finally, we check if there is an
If an image exists, we extract its src attribute to get the image URL.
Downloading and Saving Images
For each found image, we download and save it to our folder:
We use the breed name in each image filename.
And add the photograph URL to its list:
This whole process repeats for every row, extracting and storing data from each column.
Printing the Extracted Data
Finally, we can print out or process the scraped data as needed:
- Made an HTTP request for the web page HTML
- Selected the data table element
- Initialized lists to store extracted data
- Looped through rows, extracting information
- Downloaded and saved images
- Printed/processed scraped data
Full Code
Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.
Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
The whole thing can be accessed by a simple API like below in any programming language.
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
We have a running offer of 1000 API calls completely free. Register and get your free API Key here.
Related articles:
- Web Scraping Wikipedia in Scala
- Scraping All Images from a Website with Kotlin
- Scraping Wikipedia With Ruby
- Scraping Real Estate Listings From Realtor with Objective C
- Scraping All Images from a Website with R
- Scraping all the Images from a Website with Ruby
- Scraping Real Estate Listings From Realtor Using Rust
Browse by tags:
Browse by language:
Popular articles:
- Web Scraping in Python - The Complete Guide
- Working with Query Parameters in Python Requests
- How to Authenticate with Bearer Tokens in Python Requests
- Building a Simple Proxy Rotator with Kotlin and Jsoup
- The Complete BeautifulSoup Cheatsheet with Examples
- The Complete Playwright Cheatsheet
- Web Scraping using ChatGPT - Complete Guide with Examples