How to Scrape All the Images from a Website with C++

Dec 13, 2023 · 11 min read

Scraping and downloading all images from a website is a handy technique for researchers, developers, and personal projects. Modern C++ libraries make this surprisingly easy but a bit of HTML, CSS and programming knowledge is needed to pull it off.

This is page we are talking about…

Overview of the C++ Image Scraping Code

While individual libraries and selectors differ based on the target site, the high-level structure usually follows this template:

  1. Initialize libraries like libcurl and libxml2 in C++.
  2. Send an HTTP request using libcurl to download the target page HTML.
  3. Parse the HTML content using libxml2's DOM traversal functions.
  4. Use CSS selectors to identify and extract image URLs into a list.
  5. Iterate through the list of URLs to download each image file using libcurl with unique filenames.
  6. Handle issues like broken images, invalid CSS selectors gracefully.

That covers the broad strokes but the specifics require deeper understanding...

Initializing the C++ Image Scraping Tools

The first step sets up the libraries we'll leverage later for actually downloading and parsing content from the web:

// Initialize libcurl
CURL* curl = curl_easy_init();

// Initialize libxml2
htmlDocPtr doc = htmlReadDoc(response, NULL, NULL, HTML_PARSE_RECOVER | HTML_PARSE_NOERROR);

For those unfamiliar, libcurl handles the HTTP requests and responses while libxml2 processes HTML and XML documents so we can analyze and extract data using standard DOM traversal methods.

We set some options throughout the program to configure aspects like the user agent string and callbacks but the initialization gets our foundation in place.

Sending Requests and Receiving the HTML Response

Next we use libcurl to actually send a GET request to the Wikipedia URL and store the full HTML content of the page in a string variable called response.

The key aspects that enable receiving and processing the entire raw HTML are:

// Set callback function to store response
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);

// Response data will be saved here
std::string response_data;
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &response_data);

We define a custom callback function WriteCallback that receives the chunks of response data and accumulates them in our string.

With that foundation set up, we simply call curl_easy_perform to send the request and get back the full HTML response from Wikipedia to start our scraping workflow.

Parsing HTML with libxml2

Once the HTML content is fetched, it's parsed using libxml2:

htmlDocPtr doc = htmlReadDoc(reinterpret_cast<const xmlChar*>(response_data.c_str()), nullptr, nullptr, HTML_PARSE_RECOVER | HTML_PARSE_NOERROR);
  • htmlReadDoc parses the HTML content and returns a pointer to the parsed document (htmlDocPtr).
  • Extracting Data from the Table

    Inspecting the page

    You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

    The extraction process involves finding the specific table and iterating through its rows and columns:

    xmlNodePtr table = nullptr;
    xmlNodePtr cur = xmlDocGetRootElement(doc);
    
    while (cur != nullptr) {
        if (cur->type == XML_ELEMENT_NODE && xmlStrcmp(cur->name, reinterpret_cast<const xmlChar*>("table")) == 0) {
            const xmlChar* className = xmlGetProp(cur, reinterpret_cast<const xmlChar*>("class"));
            if (className && xmlStrcmp(className, reinterpret_cast<const xmlChar*>("wikitable sortable")) == 0) {
                table = cur;
                break;
            }
        }
        cur = cur->next;
    }
    
  • The code iterates through the nodes of the document to find a element with the class 'wikitable sortable'.
  • xmlDocGetRootElement gets the root element of the document.
  • xmlNodePtr is a pointer to an XML node.
  • xmlGetProp fetches the value of a property (in this case, the class name of the table).
  • Processing Rows and Columns

    Upon locating the table, the code processes each row (

    ) and column (
    or ):

    for (xmlNodePtr row = xmlFirstElementChild(table); row != nullptr; row = xmlNextElementSibling(row)) {
        if (xmlStrcmp(row->name, reinterpret_cast<const xmlChar*>("tr")) == 0) {
            xmlNodePtr column = xmlFirstElementChild(row);
            std::string name, group, local_name, photograph;
            int columnIndex = 0;
            // ... Process each column
        }
    }
    
  • xmlFirstElementChild and xmlNextElementSibling are used to iterate over the rows of the table.
  • The inner loop processes each column, extracting text and, in the case of the image column, the src attribute of the tag.
  • Handling Image URLs

    In the specific column for images, the code looks for an tag and extracts the URL:

    if (imgTag && xmlStrcmp(imgTag->name, reinterpret_cast<const xmlChar*>("img")) == 0) {
        const xmlChar* imgSrc = xmlGetProp(imgTag, reinterpret_cast<const xmlChar*>("src"));
        if (imgSrc) {
            photograph = reinterpret_cast<const char*>(imgSrc);
            // ... Code to download the image
        }
    }
    
  • xmlGetProp gets the src attribute of the tag, which is the URL of the image.
  • Storing Extracted Data

    The extracted data (dog breed names, groups, local names, and image URLs) is stored in vectors:

    std::vector<std::string> names;
    std::vector<std::string> groups;
    std::vector<std::string> local_names;
    std::vector<std::string> photographs;
    // ... Append data to these vectors
    

    Each piece of data extracted from the columns is appended to these vectors for later use.

    Downloading Images with Error Handling

    The last major phase is to iterate through the extracted image URLs and actually download each file using libcurl. This part looks similar to our initial request but instead of getting HTML content back, we write the binary image data directly to a file.

    We also add checks for common issues:

    // Download and save each image URL
    for(string url : downloads) {
    
      // Open a file stream
      std::ofstream img_file(filename);
    
      // Download image
      curl_easy_setopt(curl, CURLOPT_URL, url);
      curl_easy_setopt(curl, CURLOPT_WRITEDATA, &img_file);
      res = curl_easy_perform(curl);
    
      // Validate file size to check for issues
      int file_size = img_file.tellp();
      if (file_size < 500) {
        // Handle errors gracefully
        print("Error downloading ", url);
      }
    
      // Close this image file
      img_file.close();
    }
    

    We open unique image files to store each download locally, set the binary write callback to that file stream, kick off the download with curl_easy_perform, and finally check that a valid large file was written to catch any errors.

    The full code

    To tie the entire process together, here is a complete program to scrape images from a dogs Wikipedia page:

    #include <iostream>
    #include <string>
    #include <fstream>
    #include <vector>
    #include <curl/curl.h>
    #include <libxml/HTMLparser.h>
    #include <libxml/xpath.h>
    
    // URL of the Wikipedia page
    const std::string url = "https://commons.wikimedia.org/wiki/List_of_dog_breeds";
    
    // User-agent header to simulate a browser request
    const std::string userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36";
    
    // Function to write response data to a string
    size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* output) {
        size_t totalSize = size * nmemb;
        output->append(static_cast<char*>(contents), totalSize);
        return totalSize;
    }
    
    int main() {
        // Initialize libcurl
        CURL* curl = curl_easy_init();
        if (!curl) {
            std::cerr << "Failed to initialize libcurl." << std::endl;
            return 1;
        }
    
        // Set libcurl options
        curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
        curl_easy_setopt(curl, CURLOPT_USERAGENT, userAgent.c_str());
    
        // Response data will be stored here
        std::string response_data;
    
        // Set the write callback function to capture response data
        curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
        curl_easy_setopt(curl, CURLOPT_WRITEDATA, &response_data);
    
        // Send an HTTP GET request
        CURLcode res = curl_easy_perform(curl);
    
        // Check if the request was successful (status code 200)
        if (res == CURLE_OK) {
            // Initialize libxml2
            htmlDocPtr doc = htmlReadDoc(reinterpret_cast<const xmlChar*>(response_data.c_str()), nullptr, nullptr, HTML_PARSE_RECOVER | HTML_PARSE_NOERROR);
            if (doc) {
                // Find the table with class 'wikitable sortable'
                xmlNodePtr table = nullptr;
                xmlNodePtr cur = xmlDocGetRootElement(doc);
    
                while (cur != nullptr) {
                    if (cur->type == XML_ELEMENT_NODE && xmlStrcmp(cur->name, reinterpret_cast<const xmlChar*>("table")) == 0) {
                        const xmlChar* className = xmlGetProp(cur, reinterpret_cast<const xmlChar*>("class"));
                        if (className && xmlStrcmp(className, reinterpret_cast<const xmlChar*>("wikitable sortable")) == 0) {
                            table = cur;
                            break;
                        }
                    }
                    cur = cur->next;
                }
    
                if (table) {
                    // Initialize lists to store the data
                    std::vector<std::string> names;
                    std::vector<std::string> groups;
                    std::vector<std::string> local_names;
                    std::vector<std::string> photographs;
    
                    // Create a folder to save the images
                    std::string imageFolder = "dog_images";
                    if (mkdir(imageFolder.c_str(), 0755) != 0 && errno != EEXIST) {
                        std::cerr << "Failed to create the image folder." << std::endl;
                        return 1;
                    }
    
                    // Iterate through rows in the table (skip the header row)
                    for (xmlNodePtr row = xmlFirstElementChild(table); row != nullptr; row = xmlNextElementSibling(row)) {
                        if (xmlStrcmp(row->name, reinterpret_cast<const xmlChar*>("tr")) == 0) {
                            xmlNodePtr column = xmlFirstElementChild(row);
                            std::string name, group, local_name, photograph;
    
                            int columnIndex = 0;
                            while (column != nullptr) {
                                if (xmlStrcmp(column->name, reinterpret_cast<const xmlChar*>("td")) == 0 ||
                                    xmlStrcmp(column->name, reinterpret_cast<const xmlChar*>("th")) == 0) {
                                    std::string columnText = reinterpret_cast<const char*>(xmlNodeGetContent(column));
                                    switch (columnIndex) {
                                        case 0:
                                            name = columnText;
                                            break;
                                        case 1:
                                            group = columnText;
                                            break;
                                        case 2:
                                            local_name = columnText;
                                            break;
                                        case 3:
                                            {
                                                xmlNodePtr imgTag = xmlFirstElementChild(column);
                                                if (imgTag && xmlStrcmp(imgTag->name, reinterpret_cast<const xmlChar*>("img")) == 0) {
                                                    const xmlChar* imgSrc = xmlGetProp(imgTag, reinterpret_cast<const xmlChar*>("src"));
                                                    if (imgSrc) {
                                                        photograph = reinterpret_cast<const char*>(imgSrc);
    
                                                        // Download the image and save it to the folder
                                                        if (!photograph.empty()) {
                                                            std::string image_url = photograph;
                                                            CURL* img_curl = curl_easy_init();
                                                            if (img_curl) {
                                                                std::string image_filename = imageFolder + "/" + name + ".jpg";
                                                                std::ofstream img_file(image_filename, std::ios::binary);
                                                                curl_easy_setopt(img_curl, CURLOPT_URL, image_url.c_str());
                                                                curl_easy_setopt(img_curl, CURLOPT_WRITEDATA, &img_file);
                                                                curl_easy_perform(img_curl);
                                                                curl_easy_cleanup(img_curl);
                                                            }
                                                        }
                                                    }
                                                }
                                            }
                                            break;
                                    }
                                    columnIndex++;
                                }
                                column = xmlNextElementSibling(column);
                            }
    
                            // Append data to respective lists
                            names.push_back(name);
                            groups.push_back(group);
                            local_names.push_back(local_name);
                            photographs.push_back(photograph);
                        }
                    }
    
                    // Print or process the extracted data as needed
                    for (size_t i = 0; i < names.size(); i++) {
                        std::cout << "Name: " << names[i] << std::endl;
                        std::cout << "FCI Group: " << groups[i] << std::endl;
                        std::cout << "Local Name: " << local_names[i] << std::endl;
                        std::cout << "Photograph: " << photographs[i] << std::endl;
                        std::cout << std::endl;
                    }
                } else {
                    std::cerr << "Failed to find the table with class 'wikitable sortable'." << std::endl;
                }
    
                // Cleanup libxml2
                xmlFreeDoc(doc);
            } else {
                std::cerr << "Failed to parse the HTML document." << std::endl;
            }
        } else {
            std::cerr << "Failed to retrieve the web page. CURLcode: " << res << std::endl;
        }
    
        // Cleanup libcurl
        curl_easy_cleanup(curl);
    
        return 0;
    }

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!