Scraping all the Images from a Website with Rust

Dec 13, 2023 · 8 min read

Web scraping is the process of programmatically extracting data from websites. In this tutorial, we'll learn how to use Rust to scrape all images from a website.

Our goal is to:

  1. Send a request to download a web page
  2. Parse through the HTML content
  3. Find and save all the images on that page locally
  4. Extract and store other data like image names and categories

We'll go through each step required to build a complete web scraper. No prior scraping experience needed!

Setup

Let's briefly go over the initial setup:

extern crate reqwest;
extern crate select;

We import two key Rust crates:

  • reqwest - For sending HTTP requests and handling responses
  • select - For parsing and querying HTML/XML
  • These provide the essential web scraping capabilities we need.

    We also import standard Rust modules for file I/O and error handling:

    use std::fs;
    use std::io::Write;
    

    With the imports out of the way, let's get scraping!

    Making an HTTP Request

    This is page we are talking about…

    Let's send a simple GET request to download the raw HTML content:

    let url = "<https://commons.wikimedia.org/wiki/List_of_dog_breeds>";
    
    let client = reqwest::blocking::Client::new();
    let response = client.get(url)
                          .header("User-Agent",
                                   "Mozilla/5.0 (Windows NT 10.0; Win64; x64)
                                   AppleWebKit/537.36 (KHTML, like Gecko)
                                   Chrome/58.0.3029.110 Safari/537.36")
                          .send()?;
    

    We provide the URL we wish to scrape. Creating a Client object allows us to build HTTP requests. Calling .get() issues a GET request.

    Key Points:

  • Set a valid User-Agent header to mimic a real browser
  • Handle potential errors using ? after .send()
  • Always check the response status before proceeding!
  • Let's add a status check:

    if response.status().is_success() {
      // Scrape page
    } else {
      eprintln!("Failed with status: {}", response.status());
    }
    

    This verifies that the page loaded before scraping.

    Parsing the HTML

    Now that we've downloaded the raw HTML content, we can parse through it to extract data.

    The select crate allows querying elements by CSS selectors. Let's initialize it:

    let body = response.text()?;
    let document = Document::from_read(body.as_bytes())?;
    

    This parses the HTML body into a flexible Document structure we can traverse.

    Time to analyze the page content and locate the target table of dog breeds.

    Inspecting the page

    You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

    Using the class name, we can directly access it:

    let table = document.find(Attr("class", "wikitable sortable"))
                        .next()
                        .ok_or("Table not found")?;
    

    Note: Similar approaches work for id attributes or other CSS selectors.

    Our scraper now has access to the core content!

    Saving Images

    Let's build in functionality to save images locally as we scrape them from the table:

    fs::create_dir_all("dog_images")?;
    

    This prepares a folder to store images.

    Initializing Storage

    To store structured data from each row, we initialize some vectors:

    let mut names = vec![];
    let mut groups = vec![];
    let mut local_names = vec![];
    let mut photographs = vec![];
    

    They will capture the:

  • Dog's name
  • Group classification
  • Local name variation
  • Image file path
  • We iterate the table rows to populate them.

    Extracting Data

    Let's walk through the full data extraction logic:

    for row in table.find(Name("tr")).skip(1) {
    
      let columns = row.find(Name("td"))
                       .collect::<Vec<_>>();
    
      if columns.len() == 4 {
    
        let name = columns[0].text();
    
        let group = columns[1].text();
    
        let span_tag = columns[2].find(Name("span"))
                                  .next();
    
        let local_name = span_tag.map(|span| span.text())
                                 .unwrap_or_default();
    
         let img_tag = columns[3].find(Name("img"))
                                  .next();
    
         let photograph = img_tag.map(|img| img.attr("src")
                                                  .unwrap_or_default())
                                 .unwrap_or_default();
    
       // ...
    

    This demonstrates:

  • Accessing table cells - Using index to directly access cell contents
  • Extracting text - Calling .text() on the cell
  • Finding nested tags - Searching for span and img tags
  • Getting attributes - Safely reading src attribute of images
  • We can pull data in various ways thanks to selectors.

    Downloading Images

    Let's handle downloading and saving images:

    if !photograph.is_empty() {
    
      let image_url = photograph;
    
      let image_response = client.get(&image_url)
                                  .send()?;
    
      if image_response.status().is_success() {
    
        let image_filename = format!("dog_images/{}.jpg", name);
    
        let mut image_file = fs::File::create(image_filename)?;
    
        image_file.write_all(&image_response.bytes()?)?;
    
      }
    
    }
    

    For valid images, we:

    1. Check that the src isn't empty
    2. Make a separate request to download
    3. Verify it succeeds
    4. Construct a file name
    5. Write out the image bytes

    And we have a scraper that saves images!

    Storing Extracted Data

    Finally, let's store the extracted data:

    names.push(name);
    groups.push(group);
    
    local_names.push(local_name);
    
    photographs.push(photograph);
    

    We populate each vector accordingly to capture key details.

    And can print or process the metadata further:

    for i in 0..names.len() {
    
      println!("Name: {}", names[i]);
    
      // ...
    
    }
    

    There we have it - a complete web scraper!

    Handling Errors

    We use ? throughout for concise error handling:

    let response = client.get(url)
                          .send()?;
    

    This surfaces errors clearly. Some common ones:

  • Network failures
  • Invalid URLs/pages
  • Query/selector failures
  • Print descriptive errors before scraping to identify issues.

    extern crate reqwest;
    extern crate select;
    
    use std::fs;
    use std::io::Write;
    use select::document::Document;
    use select::predicate::{Name, Attr};
    
    fn main() -> Result<(), Box<dyn std::error::Error>> {
        // URL of the Wikipedia page
        let url = "https://commons.wikimedia.org/wiki/List_of_dog_breeds";
    
        // Send an HTTP GET request to the URL
        let client = reqwest::blocking::Client::new();
        let response = client.get(url).header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36").send()?;
    
        // Check if the request was successful (status code 200)
        if response.status().is_success() {
            // Parse the HTML content of the page
            let body = response.text()?;
            let document = Document::from_read(body.as_bytes())?;
    
            // Find the table with class 'wikitable sortable'
            let table = document.find(Attr("class", "wikitable sortable")).next().ok_or("Table not found")?;
    
            // Create a folder to save the images
            fs::create_dir_all("dog_images")?;
    
            // Initialize vectors to store the data
            let mut names = vec![];
            let mut groups = vec![];
            let mut local_names = vec![];
            let mut photographs = vec![];
    
            // Iterate through rows in the table (skip the header row)
            for row in table.find(Name("tr")).skip(1) {
                // Extract data from each column
                let columns = row.find(Name("td")).collect::<Vec<_>>();
                if columns.len() == 4 {
                    let name = columns[0].text();
                    let group = columns[1].text();
    
                    // Check if the second column contains a span element
                    let span_tag = columns[2].find(Name("span")).next();
                    let local_name = span_tag.map(|span| span.text()).unwrap_or_default();
    
                    // Check for the existence of an image tag within the fourth column
                    let img_tag = columns[3].find(Name("img")).next();
                    let photograph = img_tag.map(|img| img.attr("src").unwrap_or_default()).unwrap_or_default();
    
                    // Download the image and save it to the folder
                    if !photograph.is_empty() {
                        let image_url = photograph;
                        let image_response = client.get(&image_url).send()?;
                        if image_response.status().is_success() {
                            let image_filename = format!("dog_images/{}.jpg", name);
                            let mut image_file = fs::File::create(image_filename)?;
                            image_file.write_all(&image_response.bytes()?)?;
                        }
                    }
    
                    // Append data to respective vectors
                    names.push(name);
                    groups.push(group);
                    local_names.push(local_name);
                    photographs.push(photograph);
                }
            }
    
            // Print or process the extracted data as needed
            for i in 0..names.len() {
                println!("Name: {}", names[i]);
                println!("FCI Group: {}", groups[i]);
                println!("Local Name: {}", local_names[i]);
                println!("Photograph: {}", photographs[i]);
                println!();
            }
        } else {
            eprintln!("Failed to retrieve the web page. Status code: {}", response.status());
        }
    
        Ok(())
    }

    Wrap Up

    In this step-by-step tutorial, we:

    1. Learned how to send requests and parse HTML
    2. Extracted data like image URLs and metadata
    3. Downloaded and saved images locally
    4. Stored extracted content in structured vectors

    We covered everything needed to build a robust web scraper with Rust!

    Some ideas for next steps:

  • Process and analyze extracted data
  • Write to files/databases for persistence
  • Automate scraping multiple pages
  • In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: