Scraping All Images from a Website with Java

Dec 13, 2023 · 7 min read

Web scraping is the process of extracting data from websites automatically. This is useful for gathering large datasets that would be tedious to collect manually. Here we will go through Java code that scrapes all dog breed images from a Wikipedia page.

This is page we are talking about…

Prerequisites

To follow along, you'll need:

  • Jsoup library for Java
  • Java 8+
  • Let's dive in and see how the scraping is done!

    Logic Overview

    At a high level, the code:

    1. Connects to the target Wikipedia page
    2. Initializes variables to store the extracted data
    3. Iterates through each row of the dog breed table
    4. Downloads the images and saves them locally
    5. Prints out the extracted information

    Now let's break this down step-by-step.

    Understanding the Selectors

    While the logic may sound simple, the key part is properly extracting data from the raw HTML of the page. This is done using CSS selectors.

    CSS selectors allow targeting specific elements in the HTML document structure. For example, you can select all table rows, links, images etc.

    Jsoup implements CSS selectors for parsing the document that is retrieved from the website. Let's see how they are used here:

    Selecting the Table

    Inspecting the page

    You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

    First, we need to select the appropriate table element that actually contains the dog breed data.

    <table class="wikitable sortable">
      <!-- dog breed rows here-->
    </table>
    

    The table can be uniquely identified by its class attributes wikitable and sortable. Jsoup allows us to use a CSS selector string to target elements with given classes:

    Element table = doc.select("table.wikitable.sortable").first();
    

    Breaking this down:

  • table - selects HTML tags
  • .wikitable - class selector, targets elements with wikitable class
  • .sortable - also has sortable class
  • first() - returns just the first matching element
  • So this selector finds the table element with BOTH matching classes, uniquely identifying the dog breed table.

    Skipping Header Row

    Now that we have selected the table, we can loop through its rows:

    for (Element row : table.select("tr:gt(0)")) {
      // extract data from rows
    }
    

    Details:

  • tr selects the table row (
  • ) elements
  • :gt(0) filters to only rows GREATER THAN index 0
  • We then iterate over these rows
  • Getting Row Cells

    Next we get the cells within each row:

    Elements columns = row.select("td, th");
    

    This selects both

    and cells in the row using a multiple element selector.

    We assign them to an Elements object which acts like an array of elements.

    Extracting Text from Elements

    Finally, having isolated elements, we can extract text or other attributes from them.

    Get link text of first cell:

    String name = columns.get(0).select("a").text().trim();
    
  • columns.get(0) - first cell
  • select("a") - anchor tag inside cell
  • .text() - extract text within anchor
  • .trim() - clean whitespace
  • Other data is extracted similarly:

    String group = columns.get(1).text().trim();
    
    Element spanTag = columns.get(2).select("span").first();
    String localName = (spanTag != null) ? spanTag.text().trim() : "";
    
    Element imgTag = columns.get(3).select("img").first();
    String photograph = (imgTag != null) ? imgTag.attr("src") : "";
    

    These demonstrate usage of:

  • .text() to get cell text content
  • .select on cell elements to drill down further
  • .attr() to get attributes like src on image elements
  • Conclusion

    That covers the key functionality of the provided web scraping example. As you can see, Jsoup's selector API allows easily drilling into HTML and extraction data at will!

    The full code is provided below for reference

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    
    import java.io.BufferedInputStream;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.net.URL;
    import java.nio.file.Files;
    import java.nio.file.Path;
    import java.nio.file.Paths;
    
    public class DogBreedsScraper {
        public static void main(String[] args) {
            // URL of the Wikipedia page
            String url = "https://commons.wikimedia.org/wiki/List_of_dog_breeds";
    
            // Define a user-agent header to simulate a browser request
            String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36";
    
            try {
                // Send an HTTP GET request to the URL with the headers
                Document doc = Jsoup.connect(url).userAgent(userAgent).get();
    
                // Find the table with class 'wikitable sortable'
                Element table = doc.select("table.wikitable.sortable").first();
    
                // Initialize lists to store the data
                StringBuilder names = new StringBuilder();
                StringBuilder groups = new StringBuilder();
                StringBuilder localNames = new StringBuilder();
                StringBuilder photographs = new StringBuilder();
    
                // Create a folder to save the images
                Path imagesFolder = Paths.get("dog_images");
                Files.createDirectories(imagesFolder);
    
                // Iterate through rows in the table (skip the header row)
                for (Element row : table.select("tr:gt(0)")) {
                    Elements columns = row.select("td, th");
                    if (columns.size() == 4) {
                        // Extract data from each column
                        String name = columns.get(0).select("a").text().trim();
                        String group = columns.get(1).text().trim();
    
                        // Check if the second column contains a span element
                        Element spanTag = columns.get(2).select("span").first();
                        String localName = (spanTag != null) ? spanTag.text().trim() : "";
    
                        // Check for the existence of an image tag within the fourth column
                        Element imgTag = columns.get(3).select("img").first();
                        String photograph = (imgTag != null) ? imgTag.attr("src") : "";
    
                        // Download the image and save it to the folder
                        if (!photograph.isEmpty()) {
                            String imageFilename = Paths.get("dog_images", name + ".jpg").toString();
                            downloadImage(photograph, imageFilename);
                        }
    
                        // Append data to respective lists
                        names.append("Name: ").append(name).append("\n");
                        groups.append("FCI Group: ").append(group).append("\n");
                        localNames.append("Local Name: ").append(localName).append("\n");
                        photographs.append("Photograph: ").append(photograph).append("\n\n");
                    }
                }
    
                // Print or process the extracted data as needed
                System.out.println(names.toString());
                System.out.println(groups.toString());
                System.out.println(localNames.toString());
                System.out.println(photographs.toString());
    
            } catch (IOException e) {
                System.err.println("Failed to retrieve the web page. Error: " + e.getMessage());
            }
        }
    
        private static void downloadImage(String imageUrl, String destinationPath) throws IOException {
            URL url = new URL(imageUrl);
            try (BufferedInputStream in = new BufferedInputStream(url.openStream());
                 FileOutputStream fileOutputStream = new FileOutputStream(destinationPath)) {
                byte[] dataBuffer = new byte[1024];
                int bytesRead;
                while ((bytesRead = in.read(dataBuffer, 0, 1024)) != -1) {
                    fileOutputStream.write(dataBuffer, 0, bytesRead);
                }
            }
        }
    }

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: