How to Scrape All the Images from a Website with C++

Scraping and downloading all images from a website is a handy technique for researchers, developers, and personal projects. Modern C++ libraries make this surprisingly easy but a bit of HTML, CSS and programming knowledge is needed to pull it off.

This is page we are talking about…

Overview of the C++ Image Scraping Code

While individual libraries and selectors differ based on the target site, the high-level structure usually follows this template:

Initialize libraries like libcurl and libxml2 in C++.
Send an HTTP request using libcurl to download the target page HTML.
Parse the HTML content using libxml2's DOM traversal functions.
Use CSS selectors to identify and extract image URLs into a list.
Iterate through the list of URLs to download each image file using libcurl with unique filenames.
Handle issues like broken images, invalid CSS selectors gracefully.

That covers the broad strokes but the specifics require deeper understanding...

Initializing the C++ Image Scraping Tools

The first step sets up the libraries we'll leverage later for actually downloading and parsing content from the web:

// Initialize libcurl
CURL* curl = curl_easy_init();

// Initialize libxml2
htmlDocPtr doc = htmlReadDoc(response, NULL, NULL, HTML_PARSE_RECOVER | HTML_PARSE_NOERROR);

For those unfamiliar, libcurl handles the HTTP requests and responses while libxml2 processes HTML and XML documents so we can analyze and extract data using standard DOM traversal methods.

We set some options throughout the program to configure aspects like the user agent string and callbacks but the initialization gets our foundation in place.

Sending Requests and Receiving the HTML Response

Next we use libcurl to actually send a GET request to the Wikipedia URL and store the full HTML content of the page in a string variable called response.

The key aspects that enable receiving and processing the entire raw HTML are:

// Set callback function to store response
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);

// Response data will be saved here
std::string response_data;
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &response_data);

We define a custom callback function WriteCallback that receives the chunks of response data and accumulates them in our string.

With that foundation set up, we simply call curl_easy_perform to send the request and get back the full HTML response from Wikipedia to start our scraping workflow.

Parsing HTML with libxml2

Once the HTML content is fetched, it's parsed using libxml2:

htmlDocPtr doc = htmlReadDoc(reinterpret_cast<const xmlChar*>(response_data.c_str()), nullptr, nullptr, HTML_PARSE_RECOVER | HTML_PARSE_NOERROR);

htmlReadDoc parses the HTML content and returns a pointer to the parsed document (htmlDocPtr).

Extracting Data from the Table

Inspecting the page

You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

The extraction process involves finding the specific table and iterating through its rows and columns:

xmlNodePtr table = nullptr;
xmlNodePtr cur = xmlDocGetRootElement(doc);

while (cur != nullptr) {
    if (cur->type == XML_ELEMENT_NODE && xmlStrcmp(cur->name, reinterpret_cast<const xmlChar*>("table")) == 0) {
        const xmlChar* className = xmlGetProp(cur, reinterpret_cast<const xmlChar*>("class"));
        if (className && xmlStrcmp(className, reinterpret_cast<const xmlChar*>("wikitable sortable")) == 0) {
            table = cur;
            break;
        }
    }
    cur = cur->next;
}

The code iterates through the nodes of the document to find a element with the class 'wikitable sortable'.

xmlDocGetRootElement gets the root element of the document.

xmlNodePtr is a pointer to an XML node.

xmlGetProp fetches the value of a property (in this case, the class name of the table).

Processing Rows and Columns

Upon locating the table, the code processes each row (

) and column (

or	): `for (xmlNodePtr row = xmlFirstElementChild(table); row != nullptr; row = xmlNextElementSibling(row)) { if (xmlStrcmp(row->name, reinterpret_cast<const xmlChar>("tr")) == 0) { xmlNodePtr column = xmlFirstElementChild(row); std::string name, group, local_name, photograph; int columnIndex = 0; // ... Process each column } }` xmlFirstElementChild and xmlNextElementSibling are used to iterate over the rows of the table. The inner loop processes each column, extracting text and, in the case of the image column, the src attribute of the tag. Handling Image URLs In the specific column for images, the code looks for an tag and extracts the URL: `if (imgTag && xmlStrcmp(imgTag->name, reinterpret_cast<const xmlChar>("img")) == 0) { const xmlChar* imgSrc = xmlGetProp(imgTag, reinterpret_cast<const xmlChar>("src")); if (imgSrc) { photograph = reinterpret_cast<const char>(imgSrc); // ... Code to download the image } }` xmlGetProp gets the src attribute of the tag, which is the URL of the image. Storing Extracted Data The extracted data (dog breed names, groups, local names, and image URLs) is stored in vectors: `std::vector<std::string> names; std::vector<std::string> groups; std::vector<std::string> local_names; std::vector<std::string> photographs; // ... Append data to these vectors` Each piece of data extracted from the columns is appended to these vectors for later use. Downloading Images with Error Handling The last major phase is to iterate through the extracted image URLs and actually download each file using libcurl. This part looks similar to our initial request but instead of getting HTML content back, we write the binary image data directly to a file. We also add checks for common issues: `// Download and save each image URL for(string url : downloads) { // Open a file stream std::ofstream img_file(filename); // Download image curl_easy_setopt(curl, CURLOPT_URL, url); curl_easy_setopt(curl, CURLOPT_WRITEDATA, &img_file); res = curl_easy_perform(curl); // Validate file size to check for issues int file_size = img_file.tellp(); if (file_size < 500) { // Handle errors gracefully print("Error downloading ", url); } // Close this image file img_file.close(); }` We open unique image files to store each download locally, set the binary write callback to that file stream, kick off the download with curl_easy_perform, and finally check that a valid large file was written to catch any errors. The full code To tie the entire process together, here is a complete program to scrape images from a dogs Wikipedia page: #include <iostream> #include <string> #include <fstream> #include <vector> #include <curl/curl.h> #include <libxml/HTMLparser.h> #include <libxml/xpath.h> // URL of the Wikipedia page const std::string url = "https://commons.wikimedia.org/wiki/List_of_dog_breeds"; // User-agent header to simulate a browser request const std::string userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"; // Function to write response data to a string size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* output) { size_t totalSize = size * nmemb; output->append(static_cast<char>(contents), totalSize); return totalSize; } int main() { // Initialize libcurl CURL curl = curl_easy_init(); if (!curl) { std::cerr << "Failed to initialize libcurl." << std::endl; return 1; } // Set libcurl options curl_easy_setopt(curl, CURLOPT_URL, url.c_str()); curl_easy_setopt(curl, CURLOPT_USERAGENT, userAgent.c_str()); // Response data will be stored here std::string response_data; // Set the write callback function to capture response data curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback); curl_easy_setopt(curl, CURLOPT_WRITEDATA, &response_data); // Send an HTTP GET request CURLcode res = curl_easy_perform(curl); // Check if the request was successful (status code 200) if (res == CURLE_OK) { // Initialize libxml2 htmlDocPtr doc = htmlReadDoc(reinterpret_cast<const xmlChar>(response_data.c_str()), nullptr, nullptr, HTML_PARSE_RECOVER \| HTML_PARSE_NOERROR); if (doc) { // Find the table with class 'wikitable sortable' xmlNodePtr table = nullptr; xmlNodePtr cur = xmlDocGetRootElement(doc); while (cur != nullptr) { if (cur->type == XML_ELEMENT_NODE && xmlStrcmp(cur->name, reinterpret_cast<const xmlChar>("table")) == 0) { const xmlChar* className = xmlGetProp(cur, reinterpret_cast<const xmlChar>("class")); if (className && xmlStrcmp(className, reinterpret_cast<const xmlChar>("wikitable sortable")) == 0) { table = cur; break; } } cur = cur->next; } if (table) { // Initialize lists to store the data std::vector<std::string> names; std::vector<std::string> groups; std::vector<std::string> local_names; std::vector<std::string> photographs; // Create a folder to save the images std::string imageFolder = "dog_images"; if (mkdir(imageFolder.c_str(), 0755) != 0 && errno != EEXIST) { std::cerr << "Failed to create the image folder." << std::endl; return 1; } // Iterate through rows in the table (skip the header row) for (xmlNodePtr row = xmlFirstElementChild(table); row != nullptr; row = xmlNextElementSibling(row)) { if (xmlStrcmp(row->name, reinterpret_cast<const xmlChar>("tr")) == 0) { xmlNodePtr column = xmlFirstElementChild(row); std::string name, group, local_name, photograph; int columnIndex = 0; while (column != nullptr) { if (xmlStrcmp(column->name, reinterpret_cast<const xmlChar>("td")) == 0 \|\| xmlStrcmp(column->name, reinterpret_cast<const xmlChar>("th")) == 0) { std::string columnText = reinterpret_cast<const char>(xmlNodeGetContent(column)); switch (columnIndex) { case 0: name = columnText; break; case 1: group = columnText; break; case 2: local_name = columnText; break; case 3: { xmlNodePtr imgTag = xmlFirstElementChild(column); if (imgTag && xmlStrcmp(imgTag->name, reinterpret_cast<const xmlChar>("img")) == 0) { const xmlChar imgSrc = xmlGetProp(imgTag, reinterpret_cast<const xmlChar>("src")); if (imgSrc) { photograph = reinterpret_cast<const char>(imgSrc); // Download the image and save it to the folder if (!photograph.empty()) { std::string image_url = photograph; CURL* img_curl = curl_easy_init(); if (img_curl) { std::string image_filename = imageFolder + "/" + name + ".jpg"; std::ofstream img_file(image_filename, std::ios::binary); curl_easy_setopt(img_curl, CURLOPT_URL, image_url.c_str()); curl_easy_setopt(img_curl, CURLOPT_WRITEDATA, &img_file); curl_easy_perform(img_curl); curl_easy_cleanup(img_curl); } } } } } break; } columnIndex++; } column = xmlNextElementSibling(column); } // Append data to respective lists names.push_back(name); groups.push_back(group); local_names.push_back(local_name); photographs.push_back(photograph); } } // Print or process the extracted data as needed for (size_t i = 0; i < names.size(); i++) { std::cout << "Name: " << names[i] << std::endl; std::cout << "FCI Group: " << groups[i] << std::endl; std::cout << "Local Name: " << local_names[i] << std::endl; std::cout << "Photograph: " << photographs[i] << std::endl; std::cout << std::endl; } } else { std::cerr << "Failed to find the table with class 'wikitable sortable'." << std::endl; } // Cleanup libxml2 xmlFreeDoc(doc); } else { std::cerr << "Failed to parse the HTML document." << std::endl; } } else { std::cerr << "Failed to retrieve the web page. CURLcode: " << res << std::endl; } // Cleanup libcurl curl_easy_cleanup(curl); return 0; } In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser! If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail. Overcoming IP Blocks Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works. Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. With millions of high speed rotating proxies located all over the world, With our automatic IP rotation With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions) With our automatic CAPTCHA solving technology, Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. `curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"` We have a running offer of 1000 API calls completely free. Register and get your free API Key here. Browse by tags: web scraping HTML C++libcurl libxml2 image scraping CSS Browse by language: C# PHP Python JavaScript Rust Ruby Go C++ Objective-C Scala Elixir Kotlin Perl R Java The easiest way to do Web Scraping Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you Try ProxiesAPI for free curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> <html> <head> <title>Example Domain</title> <meta charset="utf-8" /> <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> ... Tired of getting blocked while scraping the web? Get access to 1,000 free API credits, no credit card required! Try for free X Don't leave just yet! Enter your email below to claim your free API key:

for (xmlNodePtr row = xmlFirstElementChild(table); row != nullptr; row = xmlNextElementSibling(row)) {
    if (xmlStrcmp(row->name, reinterpret_cast<const xmlChar*>("tr")) == 0) {
        xmlNodePtr column = xmlFirstElementChild(row);
        std::string name, group, local_name, photograph;
        int columnIndex = 0;
        // ... Process each column
    }
}

xmlFirstElementChild and xmlNextElementSibling are used to iterate over the rows of the table.

The inner loop processes each column, extracting text and, in the case of the image column, the src attribute of the tag.

Handling Image URLs

In the specific column for images, the code looks for an tag and extracts the URL:

if (imgTag && xmlStrcmp(imgTag->name, reinterpret_cast<const xmlChar*>("img")) == 0) {
    const xmlChar* imgSrc = xmlGetProp(imgTag, reinterpret_cast<const xmlChar*>("src"));
    if (imgSrc) {
        photograph = reinterpret_cast<const char*>(imgSrc);
        // ... Code to download the image
    }
}

xmlGetProp gets the src attribute of the tag, which is the URL of the image.

Storing Extracted Data

The extracted data (dog breed names, groups, local names, and image URLs) is stored in vectors:

std::vector<std::string> names;
std::vector<std::string> groups;
std::vector<std::string> local_names;
std::vector<std::string> photographs;
// ... Append data to these vectors

Each piece of data extracted from the columns is appended to these vectors for later use.

Downloading Images with Error Handling

The last major phase is to iterate through the extracted image URLs and actually download each file using libcurl. This part looks similar to our initial request but instead of getting HTML content back, we write the binary image data directly to a file.

We also add checks for common issues:

// Download and save each image URL
for(string url : downloads) {

  // Open a file stream
  std::ofstream img_file(filename);

  // Download image
  curl_easy_setopt(curl, CURLOPT_URL, url);
  curl_easy_setopt(curl, CURLOPT_WRITEDATA, &img_file);
  res = curl_easy_perform(curl);

  // Validate file size to check for issues
  int file_size = img_file.tellp();
  if (file_size < 500) {
    // Handle errors gracefully
    print("Error downloading ", url);
  }

  // Close this image file
  img_file.close();
}

We open unique image files to store each download locally, set the binary write callback to that file stream, kick off the download with curl_easy_perform, and finally check that a valid large file was written to catch any errors.

The full code

To tie the entire process together, here is a complete program to scrape images from a dogs Wikipedia page:

#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>

// URL of the Wikipedia page
const std::string url = "https://commons.wikimedia.org/wiki/List_of_dog_breeds";

// User-agent header to simulate a browser request
const std::string userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36";

// Function to write response data to a string
size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* output) {
    size_t totalSize = size * nmemb;
    output->append(static_cast<char*>(contents), totalSize);
    return totalSize;
}

int main() {
    // Initialize libcurl
    CURL* curl = curl_easy_init();
    if (!curl) {
        std::cerr << "Failed to initialize libcurl." << std::endl;
        return 1;
    }

    // Set libcurl options
    curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
    curl_easy_setopt(curl, CURLOPT_USERAGENT, userAgent.c_str());

    // Response data will be stored here
    std::string response_data;

    // Set the write callback function to capture response data
    curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
    curl_easy_setopt(curl, CURLOPT_WRITEDATA, &response_data);

    // Send an HTTP GET request
    CURLcode res = curl_easy_perform(curl);

    // Check if the request was successful (status code 200)
    if (res == CURLE_OK) {
        // Initialize libxml2
        htmlDocPtr doc = htmlReadDoc(reinterpret_cast<const xmlChar*>(response_data.c_str()), nullptr, nullptr, HTML_PARSE_RECOVER | HTML_PARSE_NOERROR);
        if (doc) {
            // Find the table with class 'wikitable sortable'
            xmlNodePtr table = nullptr;
            xmlNodePtr cur = xmlDocGetRootElement(doc);

            while (cur != nullptr) {
                if (cur->type == XML_ELEMENT_NODE && xmlStrcmp(cur->name, reinterpret_cast<const xmlChar*>("table")) == 0) {
                    const xmlChar* className = xmlGetProp(cur, reinterpret_cast<const xmlChar*>("class"));
                    if (className && xmlStrcmp(className, reinterpret_cast<const xmlChar*>("wikitable sortable")) == 0) {
                        table = cur;
                        break;
                    }
                }
                cur = cur->next;
            }

            if (table) {
                // Initialize lists to store the data
                std::vector<std::string> names;
                std::vector<std::string> groups;
                std::vector<std::string> local_names;
                std::vector<std::string> photographs;

                // Create a folder to save the images
                std::string imageFolder = "dog_images";
                if (mkdir(imageFolder.c_str(), 0755) != 0 && errno != EEXIST) {
                    std::cerr << "Failed to create the image folder." << std::endl;
                    return 1;
                }

                // Iterate through rows in the table (skip the header row)
                for (xmlNodePtr row = xmlFirstElementChild(table); row != nullptr; row = xmlNextElementSibling(row)) {
                    if (xmlStrcmp(row->name, reinterpret_cast<const xmlChar*>("tr")) == 0) {
                        xmlNodePtr column = xmlFirstElementChild(row);
                        std::string name, group, local_name, photograph;

                        int columnIndex = 0;
                        while (column != nullptr) {
                            if (xmlStrcmp(column->name, reinterpret_cast<const xmlChar*>("td")) == 0 ||
                                xmlStrcmp(column->name, reinterpret_cast<const xmlChar*>("th")) == 0) {
                                std::string columnText = reinterpret_cast<const char*>(xmlNodeGetContent(column));
                                switch (columnIndex) {
                                    case 0:
                                        name = columnText;
                                        break;
                                    case 1:
                                        group = columnText;
                                        break;
                                    case 2:
                                        local_name = columnText;
                                        break;
                                    case 3:
                                        {
                                            xmlNodePtr imgTag = xmlFirstElementChild(column);
                                            if (imgTag && xmlStrcmp(imgTag->name, reinterpret_cast<const xmlChar*>("img")) == 0) {
                                                const xmlChar* imgSrc = xmlGetProp(imgTag, reinterpret_cast<const xmlChar*>("src"));
                                                if (imgSrc) {
                                                    photograph = reinterpret_cast<const char*>(imgSrc);

                                                    // Download the image and save it to the folder
                                                    if (!photograph.empty()) {
                                                        std::string image_url = photograph;
                                                        CURL* img_curl = curl_easy_init();
                                                        if (img_curl) {
                                                            std::string image_filename = imageFolder + "/" + name + ".jpg";
                                                            std::ofstream img_file(image_filename, std::ios::binary);
                                                            curl_easy_setopt(img_curl, CURLOPT_URL, image_url.c_str());
                                                            curl_easy_setopt(img_curl, CURLOPT_WRITEDATA, &img_file);
                                                            curl_easy_perform(img_curl);
                                                            curl_easy_cleanup(img_curl);
                                                        }
                                                    }
                                                }
                                            }
                                        }
                                        break;
                                }
                                columnIndex++;
                            }
                            column = xmlNextElementSibling(column);
                        }

                        // Append data to respective lists
                        names.push_back(name);
                        groups.push_back(group);
                        local_names.push_back(local_name);
                        photographs.push_back(photograph);
                    }
                }

                // Print or process the extracted data as needed
                for (size_t i = 0; i < names.size(); i++) {
                    std::cout << "Name: " << names[i] << std::endl;
                    std::cout << "FCI Group: " << groups[i] << std::endl;
                    std::cout << "Local Name: " << local_names[i] << std::endl;
                    std::cout << "Photograph: " << photographs[i] << std::endl;
                    std::cout << std::endl;
                }
            } else {
                std::cerr << "Failed to find the table with class 'wikitable sortable'." << std::endl;
            }

            // Cleanup libxml2
            xmlFreeDoc(doc);
        } else {
            std::cerr << "Failed to parse the HTML document." << std::endl;
        }
    } else {
        std::cerr << "Failed to retrieve the web page. CURLcode: " << res << std::endl;
    }

    // Cleanup libcurl
    curl_easy_cleanup(curl);

    return 0;
}

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you

Try ProxiesAPI for free

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
...