Downloading Images from a Website with Ruby and Nokogiri

Oct 15, 2023 · 4 min read

In this article, we will learn how to use Ruby and the Nokogiri library to download all the images from a Wikipedia page.

—-

Overview

The goal is to extract the names, breed groups, local names, and image URLs for all dog breeds listed on this Wikipedia page. We will store the image URLs, download the images and save them to a local folder.

Here are the key steps we will cover:

  1. Require libraries
  2. Send HTTP request to fetch the Wikipedia page
  3. Parse the page HTML using Nokogiri
  4. Find the table with dog breed data
  5. Iterate through the table rows
  6. Extract data from each column
  7. Download images and save locally
  8. Print/process extracted data

Let's go through each of these steps in detail.

Requires

We need these libraries:

require 'nokogiri'
require 'open-uri'
  • nokogiri - HTML/XML parser
  • open-uri - Sends HTTP requests
  • Send HTTP Request

    To download the web page:

    url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>'
    
    user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    
    html = URI.open(url, 'User-Agent' => user_agent)
    

    We provide a user agent and use URI.open to fetch the page HTML.

    Parse HTML

    To parse the HTML:

    doc = Nokogiri::HTML(html)
    

    The Nokogiri::HTML object allows querying and searching the document.

    Find Breed Table

    We can find the table using a CSS selector:

    table = doc.css('table.wikitable.sortable')
    

    This selects the table element by its CSS classes.

    Iterate Through Rows

    We loop through the rows:

    table.css('tr').drop(1).each do |row|
    
      # Extract data
    
    end
    

    We drop the first row which contains headers.

    Extract Column Data

    Inside the loop, we extract the column data:

    cells = row.css('td, th')
    
    name = cells[0].at_css('a').text.strip
    group = cells[1].text.strip
    
    local_name_node = cells[2].at_css('span')
    local_name = local_name_node.text.strip if local_name_node
    
    img_node = cells[3].at_css('img')
    photograph = img_node['src'] if img_node
    

    We use text for text elements and [] to get attributes.

    Download Images

    To download and save images:

    if photograph
    
      image = URI.open(photograph, 'User-Agent' => user_agent)
    
      File.open("dog_images/#{name}.jpg", 'wb') do |file|
        file << image.read
      end
    
    end
    

    We reuse the user agent and write the image bytes to a file.

    Store Extracted Data

    We store the extracted data:

    names << name
    groups << group
    local_names << local_name
    photographs << photograph
    

    The arrays can then be processed as needed.

    And that's it! Here is the full code:

    # Full code
    
    require 'nokogiri'
    require 'open-uri'
    
    # Arrays to store data
    names = []
    groups = []
    local_names = []
    photographs = []
    
    url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>'
    
    user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    
    html = URI.open(url, 'User-Agent' => user_agent)
    
    doc = Nokogiri::HTML(html)
    
    table = doc.css('table.wikitable.sortable')
    
    table.css('tr').drop(1).each do |row|
    
      cells = row.css('td, th')
    
      name = cells[0].at_css('a').text.strip
      group = cells[1].text.strip
    
      local_name_node = cells[2].at_css('span')
      local_name = local_name_node.text.strip if local_name_node
    
      img_node = cells[3].at_css('img')
      photograph = img_node['src'] if img_node
    
      if photograph
    
        image = URI.open(photograph, 'User-Agent' => user_agent)
    
        File.open("dog_images/#{name}.jpg", 'wb') do |file|
          file << image.read
        end
    
      end
    
      names << name
      groups << group
      local_names << local_name
      photographs << photograph
    
    end
    

    This provides a complete Ruby solution using Nokogiri to scrape data and images from HTML tables. The same approach can apply to many websites.

    While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

    Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

    This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

    With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!