Scraping all the Images from a Website with Rust

Web scraping is the process of programmatically extracting data from websites. In this tutorial, we'll learn how to use Rust to scrape all images from a website.

Our goal is to:

Send a request to download a web page
Parse through the HTML content
Find and save all the images on that page locally
Extract and store other data like image names and categories

We'll go through each step required to build a complete web scraper. No prior scraping experience needed!

Setup

Let's briefly go over the initial setup:

extern crate reqwest;
extern crate select;

We import two key Rust crates:

reqwest - For sending HTTP requests and handling responses

select - For parsing and querying HTML/XML

These provide the essential web scraping capabilities we need.

We also import standard Rust modules for file I/O and error handling:

use std::fs;
use std::io::Write;

With the imports out of the way, let's get scraping!

Making an HTTP Request

This is page we are talking about…

Let's send a simple GET request to download the raw HTML content:

let url = "<https://commons.wikimedia.org/wiki/List_of_dog_breeds>";

let client = reqwest::blocking::Client::new();
let response = client.get(url)
                      .header("User-Agent",
                               "Mozilla/5.0 (Windows NT 10.0; Win64; x64)
                               AppleWebKit/537.36 (KHTML, like Gecko)
                               Chrome/58.0.3029.110 Safari/537.36")
                      .send()?;

We provide the URL we wish to scrape. Creating a Client object allows us to build HTTP requests. Calling .get() issues a GET request.

Key Points:

Set a valid User-Agent header to mimic a real browser

Handle potential errors using ? after .send()

Always check the response status before proceeding!

Let's add a status check:

if response.status().is_success() {
  // Scrape page
} else {
  eprintln!("Failed with status: {}", response.status());
}

This verifies that the page loaded before scraping.

Parsing the HTML

Now that we've downloaded the raw HTML content, we can parse through it to extract data.

The select crate allows querying elements by CSS selectors. Let's initialize it:

let body = response.text()?;
let document = Document::from_read(body.as_bytes())?;

This parses the HTML body into a flexible Document structure we can traverse.

Time to analyze the page content and locate the target table of dog breeds.

Inspecting the page

You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

Using the class name, we can directly access it:

let table = document.find(Attr("class", "wikitable sortable"))
                    .next()
                    .ok_or("Table not found")?;

Note: Similar approaches work for id attributes or other CSS selectors.

Our scraper now has access to the core content!

Saving Images

Let's build in functionality to save images locally as we scrape them from the table:

fs::create_dir_all("dog_images")?;

This prepares a folder to store images.

Initializing Storage

To store structured data from each row, we initialize some vectors:

let mut names = vec![];
let mut groups = vec![];
let mut local_names = vec![];
let mut photographs = vec![];

They will capture the:

Dog's name

Group classification

Local name variation

Image file path

We iterate the table rows to populate them.

Extracting Data

Let's walk through the full data extraction logic:

for row in table.find(Name("tr")).skip(1) {

  let columns = row.find(Name("td"))
                   .collect::<Vec<_>>();

  if columns.len() == 4 {

    let name = columns[0].text();

    let group = columns[1].text();

    let span_tag = columns[2].find(Name("span"))
                              .next();

    let local_name = span_tag.map(|span| span.text())
                             .unwrap_or_default();

     let img_tag = columns[3].find(Name("img"))
                              .next();

     let photograph = img_tag.map(|img| img.attr("src")
                                              .unwrap_or_default())
                             .unwrap_or_default();

   // ...

This demonstrates:

Accessing table cells - Using index to directly access cell contents

Extracting text - Calling .text() on the cell

Finding nested tags - Searching for span and img tags

Getting attributes - Safely reading src attribute of images

We can pull data in various ways thanks to selectors.

Downloading Images

Let's handle downloading and saving images:

if !photograph.is_empty() {

  let image_url = photograph;

  let image_response = client.get(&image_url)
                              .send()?;

  if image_response.status().is_success() {

    let image_filename = format!("dog_images/{}.jpg", name);

    let mut image_file = fs::File::create(image_filename)?;

    image_file.write_all(&image_response.bytes()?)?;

  }

}

For valid images, we:

Check that the src isn't empty
Make a separate request to download
Verify it succeeds
Construct a file name
Write out the image bytes

And we have a scraper that saves images!

Storing Extracted Data

Finally, let's store the extracted data:

names.push(name);
groups.push(group);

local_names.push(local_name);

photographs.push(photograph);

We populate each vector accordingly to capture key details.

And can print or process the metadata further:

for i in 0..names.len() {

  println!("Name: {}", names[i]);

  // ...

}

There we have it - a complete web scraper!

Handling Errors

We use ? throughout for concise error handling:

let response = client.get(url)
                      .send()?;

This surfaces errors clearly. Some common ones:

Network failures

Invalid URLs/pages

Query/selector failures

Print descriptive errors before scraping to identify issues.

extern crate reqwest;
extern crate select;

use std::fs;
use std::io::Write;
use select::document::Document;
use select::predicate::{Name, Attr};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // URL of the Wikipedia page
    let url = "https://commons.wikimedia.org/wiki/List_of_dog_breeds";

    // Send an HTTP GET request to the URL
    let client = reqwest::blocking::Client::new();
    let response = client.get(url).header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36").send()?;

    // Check if the request was successful (status code 200)
    if response.status().is_success() {
        // Parse the HTML content of the page
        let body = response.text()?;
        let document = Document::from_read(body.as_bytes())?;

        // Find the table with class 'wikitable sortable'
        let table = document.find(Attr("class", "wikitable sortable")).next().ok_or("Table not found")?;

        // Create a folder to save the images
        fs::create_dir_all("dog_images")?;

        // Initialize vectors to store the data
        let mut names = vec![];
        let mut groups = vec![];
        let mut local_names = vec![];
        let mut photographs = vec![];

        // Iterate through rows in the table (skip the header row)
        for row in table.find(Name("tr")).skip(1) {
            // Extract data from each column
            let columns = row.find(Name("td")).collect::<Vec<_>>();
            if columns.len() == 4 {
                let name = columns[0].text();
                let group = columns[1].text();

                // Check if the second column contains a span element
                let span_tag = columns[2].find(Name("span")).next();
                let local_name = span_tag.map(|span| span.text()).unwrap_or_default();

                // Check for the existence of an image tag within the fourth column
                let img_tag = columns[3].find(Name("img")).next();
                let photograph = img_tag.map(|img| img.attr("src").unwrap_or_default()).unwrap_or_default();

                // Download the image and save it to the folder
                if !photograph.is_empty() {
                    let image_url = photograph;
                    let image_response = client.get(&image_url).send()?;
                    if image_response.status().is_success() {
                        let image_filename = format!("dog_images/{}.jpg", name);
                        let mut image_file = fs::File::create(image_filename)?;
                        image_file.write_all(&image_response.bytes()?)?;
                    }
                }

                // Append data to respective vectors
                names.push(name);
                groups.push(group);
                local_names.push(local_name);
                photographs.push(photograph);
            }
        }

        // Print or process the extracted data as needed
        for i in 0..names.len() {
            println!("Name: {}", names[i]);
            println!("FCI Group: {}", groups[i]);
            println!("Local Name: {}", local_names[i]);
            println!("Photograph: {}", photographs[i]);
            println!();
        }
    } else {
        eprintln!("Failed to retrieve the web page. Status code: {}", response.status());
    }

    Ok(())
}

Wrap Up

In this step-by-step tutorial, we:

Learned how to send requests and parse HTML
Extracted data like image URLs and metadata
Downloaded and saved images locally
Stored extracted content in structured vectors

We covered everything needed to build a robust web scraper with Rust!

Some ideas for next steps:

Process and analyze extracted data

Write to files/databases for persistence

Automate scraping multiple pages

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Scraping all the Images from a Website with Rust

Setup

Making an HTTP Request

Parsing the HTML

Inspecting the page

Saving Images

Initializing Storage

Extracting Data

Downloading Images

Storing Extracted Data

Handling Errors

Wrap Up

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping all the Images from a Website with Rust

Setup

Making an HTTP Request

Parsing the HTML

Inspecting the page

Saving Images

Initializing Storage

Extracting Data

Downloading Images

Storing Extracted Data

Handling Errors

Wrap Up

The easiest way to do Web Scraping

Don't leave just yet!