Downloading Images from a Website with C++ and cpp-selector

Oct 15, 2023 · 5 min read

In this article, we will learn how to use C++ and the cpp-httplib and cpp-selector libraries to download all the images from a Wikipedia page.

—-

Overview

The goal is to extract the names, breed groups, local names, and image URLs for all dog breeds listed on this Wikipedia page. We will store the image URLs, download the images and save them to a local folder.

Here are the key steps we will cover:

  1. Include required headers
  2. Send HTTP request to fetch the Wikipedia page
  3. Parse the page HTML using cpp-selector
  4. Find the table with dog breed data using a CSS selector
  5. Iterate through the table rows
  6. Extract data from each column
  7. Download images and save locally
  8. Print/process extracted data

Let's go through each of these steps in detail.

Includes

We need these headers:

#include <httplib.h>
#include <selector/selector.h>
#include <fstream>
  • httplib - Sends HTTP requests
  • selector - Parses HTML/XML
  • fstream - File I/O
  • Send HTTP Request

    To download the web page:

    httplib::Client cli("commons.wikimedia.org");
    
    auto res = cli.Get("/wiki/List_of_dog_breeds",
      {{"User-Agent", "cpp-httplib"}});
    
    if(res) {
      // Parse HTML
    }
    

    We make a GET request and provide a custom user-agent.

    Parse HTML

    To parse the HTML:

    pugi::xml_document doc;
    doc.load(res->body.c_str());
    
    auto html = doc.child("html");
    

    The pugi::xml_document represents parsed HTML.

    Find Breed Table

    We use a CSS selector to find the table element:

    auto table = html.select_node("table.wikitable.sortable").node();
    

    This selects the

    with the required CSS classes.

    Iterate Through Rows

    We loop through the rows:

    for (auto& row : table.select_nodes("tr")) {
    
      // Extract data
    
    }
    

    We iterate through

    elements within the table.

    Extract Column Data

    Inside the loop, we get the column data:

    auto cells = row.select_nodes("td, th");
    
    auto name = cells[0].child("a").text().get();
    auto group = cells[1].text().get();
    
    auto localNameNode = cells[2].select_node("span");
    auto localName = localNameNode.text().get("");
    
    auto img = cells[3].select_node("img");
    auto photograph = img.attribute("src").value();
    

    We use text() and attribute() methods to extract data.

    Download Images

    To download and save images:

    if (!photograph.empty()) {
    
      auto img_data = cli.Get(photograph.c_str());
    
      std::ofstream file("dog_images/" + name + ".jpg", std::ios::binary);
      file << img_data->body;
    
    }
    

    We reuse the HTTP client and write image bytes to a file.

    Store Extracted Data

    We store the extracted data:

    names.push_back(name);
    groups.push_back(group);
    localNames.push_back(localName);
    photographs.push_back(photograph);
    

    The vectors can then be processed as needed.

    And that's it! Here is the full code:

    // Includes
    #include <httplib.h>
    #include <selector/selector.h>
    #include <fstream>
    #include <vector>
    
    // Vectors to store data
    std::vector<std::string> names;
    std::vector<std::string> groups;
    std::vector<std::string> localNames;
    std::vector<std::string> photographs;
    
    // HTTP client
    httplib::Client cli("commons.wikimedia.org");
    
    // Send request
    auto res = cli.Get("/wiki/List_of_dog_breeds",
      {{"User-Agent", "cpp-httplib"}});
    
    if(res) {
    
      // Parse HTML
      pugi::xml_document doc;
      doc.load(res->body.c_str());
    
      auto html = doc.child("html");
    
      // Find table
      auto table = html.select_node("table.wikitable.sortable").node();
    
      // Iterate rows
      for (auto& row : table.select_nodes("tr")) {
    
        // Get cells
        auto cells = row.select_nodes("td, th");
    
        // Extract data
        auto name = cells[0].child("a").text().get();
        auto group = cells[1].text().get();
    
        auto localNameNode = cells[2].select_node("span");
        auto localName = localNameNode.text().get("");
    
        auto img = cells[3].select_node("img");
        auto photograph = img.attribute("src").value();
    
        // Download image
        if (!photograph.empty()) {
    
          auto img_data = cli.Get(photograph.c_str());
    
          std::ofstream file("dog_images/" + name + ".jpg", std::ios::binary);
          file << img_data->body;
    
        }
    
        // Store data
        names.push_back(name);
        groups.push_back(group);
        localNames.push_back(localName);
        photographs.push_back(photograph);
    
      }
    
    }
    

    This provides a complete C++ solution using cpp-httplib and cpp-selector to scrape data and images from HTML tables. The same approach can apply to many websites.

    While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

    Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

    This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

    With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!