Scraping all the Images from a Website with Ruby

Dec 13, 2023 · 8 min read

Introduction

In this article, we will be scraping the "List of dog breeds" Wikipedia page to extract information and images of different dog breeds. Our end goal is to save all dog breed photos locally along with metadata like the breed name, breed group, and local breed name.

This is page we are talking about…

To achieve this, we will send an HTTP request to download the raw HTML content of the Wikipedia page. We will then use the Nokogiri library in Ruby to parse the HTML and xpath selectors to extract the data we want from the structured content.

The full Ruby code to accomplish this web scraping is provided at the end for reference. We will walk through it section by section to understand the logic and mechanics behind each part.

Prerequisites

Before we dive into the code, let's outline the prerequisites needed to follow along:

Languages:

  • Ruby We use Ruby as our main programming language here. Basic Ruby syntax will be helpful to understand what's going on.
  • Libraries:

  • open-uri Provides easy access in Ruby to fetch remote resources over HTTP and HTTPS. We use this to send the GET request.
  • nokogiri XML/HTML parser for Ruby. We rely on Nokogiri's methods to parse and query the HTML content.
  • fileutils Adds extra file utility methods for Ruby. We use it to create directories and write image files.
  • Installation: All of the above libraries can be installed via gem install {library_name}

    For example:

    gem install nokogiri
    

    We also want to be in an environment with Ruby setup properly to run code. This could be through tools like rvm, rbenv, etc on your local machine or an online IDE.

    Sending the Request

    We start by defining the URL of the Wikipedia page we want to scrape:

    url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>'
    

    Next, we setup a user agent header string:

    user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    

    This simulates a request coming from a Chrome browser on Windows. Websites tend to block scraping requests missing user agent headers so this helps avoid access issues.

    We then use Ruby's handy open-uri library to send a GET request to the URL. The user agent header is passed along so the website thinks this is coming from a real browser:

    html_content = URI.open(url, 'User-Agent' => user_agent).read
    

    The page HTML content is downloaded and saved into the html_content variable. We handle potential errors around connectivity, server issues, etc to retry failed requests if needed.

    Parsing the HTML

    Now that we've fetched the raw HTML of the Wikipedia page, we want to parse it so we can extract the data we want.

    This is where Nokogiri comes in. Nokogiri allows us to take the HTML and turn it into a parseable DOM structure.

    doc = Nokogiri::HTML(html_content)
    

    The doc variable now contains a structured Document Object Model (DOM) representation of the HTML.

    Inspecting the page

    You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

    We can use Nokogiri's methods combined with xpath selectors to query elements just like we would in the browser console.

    For example, to find the main table element:

    table = doc.at('table.wikitable.sortable')
    

    Here we are looking for a

    tag with CSS classes wikitable and sortable. The .at() method returns just the first matching element.

    Extracting the Data

    Now that we've zoomed into the main table element, we can focus our attention on extracting the data from it.

    We loop through each

    row, skipping the header:

    table.search('tr')[1..-1].each do |row|
    
      # extraction logic
    
    end
    

    Inside this loop, we dig into each table column

    for the data pieces we want.

    Breed name:

    name = columns[0].at('a').text.strip
    

    Breed group:

    group = columns[1].text.strip
    

    Local breed name:

    span_tag = columns[2].at('span')
    
    local_name = span_tag ? span_tag.text.strip : ''
    

    And most importantly, the image URL:

    img_tag = columns[3].at('img')
    
    photograph = img_tag ? img_tag['src'] : ''
    

    We check if image and span tags exist before extracting text as some rows lack this data.

    With the image URL, we can then download the photo and save it locally:

    if !photograph.empty?
          image_url = URI.join(url, photograph).to_s
          image_filename = File.join('dog_images', "#{name}.jpg")
    
          File.open(image_filename, 'wb') do |img_file|
            img_file.write(URI.open(image_url, 'User-Agent' => user_agent).read)
          end
        end

    We apply error handling around the image download in case issues come up.

    As we extract, all data gets stored into arrays to process later.

    Processing Results

    Now that we've parsed through the entire table and extracted the data, the arrays contain all the information we wanted about these dog breeds.

    We can iterate through and print it out:

    names.each_index do |i|
    
      puts "Name: #{names[i]}"
      puts "FCI Group: #{groups[i]}"
    
      # etc...
    
    end
    

    The data could also be saved to a database, exported to CSV, analyzed further etc.

    Conclusion

    In this article, we walked through a full web scraping script to extract images and information on dog breeds from Wikipedia.

    We learned how to:

  • Send GET requests with simulated browser headers
  • Parse HTML using Nokogiri
  • Use XPath selectors to extract data
  • Handle edge cases and missing data
  • Download files from remote URLs
  • Full code again here for reference:

    require 'open-uri'
    require 'nokogiri'
    require 'fileutils'
    require 'uri'
    
    # URL of the Wikipedia page
    url = 'https://commons.wikimedia.org/wiki/List_of_dog_breeds'
    
    # Define a user-agent header to simulate a browser request
    user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    
    # Send an HTTP GET request to the URL with the headers
    html_content = URI.open(url, 'User-Agent' => user_agent).read
    
    # Parse the HTML content of the page
    doc = Nokogiri::HTML(html_content)
    
    # Find the table with class 'wikitable sortable'
    table = doc.at('table.wikitable.sortable')
    
    # Initialize arrays to store the data
    names = []
    groups = []
    local_names = []
    photographs = []
    
    # Create a folder to save the images
    FileUtils.mkdir_p('dog_images')
    
    # Iterate through rows in the table (skip the header row)
    table.search('tr')[1..-1].each do |row|
      columns = row.search('th, td')
      if columns.length == 4
        # Extract data from each column
        name = columns[0].at('a').text.strip
        group = columns[1].text.strip
    
        # Check if the second column contains a span element
        span_tag = columns[2].at('span')
        local_name = span_tag ? span_tag.text.strip : ''
    
        # Check for the existence of an image tag within the fourth column
        img_tag = columns[3].at('img')
        photograph = img_tag ? img_tag['src'] : ''
    
        # Download the image and save it to the folder
        if !photograph.empty?
          image_url = URI.join(url, photograph).to_s
          image_filename = File.join('dog_images', "#{name}.jpg")
    
          File.open(image_filename, 'wb') do |img_file|
            img_file.write(URI.open(image_url, 'User-Agent' => user_agent).read)
          end
        end
    
        # Append data to respective arrays
        names << name
        groups << group
        local_names << local_name
        photographs << photograph
      end
    end
    
    # Print or process the extracted data as needed
    names.each_index do |i|
      puts "Name: #{names[i]}"
      puts "FCI Group: #{groups[i]}"
      puts "Local Name: #{local_names[i]}"
      puts "Photograph: #{photographs[i]}"
      puts
    end
    

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!