Scraping Hacker News with C++

Hacker News is a popular social news website focused on computer science and entrepreneurship topics. It features user-submitted links and discussions, akin to a programming-focused reddit. In this beginner's guide, we will walk through Python code to scrape articles from the Hacker News homepage using web scraping.

This is the page we are talking about…

Prerequisites

To follow along, you'll need:

libcurl

Gumbo parser

Install libcurl and Gumbo with apt:

apt install libcurl4-openssl-dev libgumbo-dev

And include the necessary headers in your Python code:

import curl
import gumbo

Overview

The goal of our script is to scrape information from the articles shown on the Hacker News homepage, including:

Title

URL

Points

Author

Timestamp

Number of Comments

To achieve this, we will:

Send a GET request to retrieve the page HTML
Parse the HTML content using Gumbo
Extract information by selecting elements
Print out the scraped data

Let's take a look section-by-section!

Initialize libcurl

We start by initializing libcurl which we'll use to send the HTTP requests:

curl = curl_easy_init()
if (!curl) {
    // Error handling
}

Define URL and Send Request

Next we set the URL to scrape - the Hacker News homepage:

url = "<https://news.ycombinator.com/>"

curl_easy_setopt(curl, CURLOPT_URL, url.c_str());

We also attach callbacks to accumulate the response and write it to a string variable that will hold the page HTML:

std::string response_data;

curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &response_data);

Finally, we kick off the request and handle any errors:

CURLcode res = curl_easy_perform(curl);

if (res != CURLE_OK) {
    // Error handling
}

Now response_data contains the full HTML of the Hacker News homepage.

Parse HTML with Gumbo

The next step is to parse the HTML content using Gumbo, an HTML5 parsing library.

We initialize Gumbo, passing in the page HTML, which gives us a parsed DOM tree to query:

GumboOutput* output = gumbo_parse(response_data.c_str());
GumboNode* root = output->root;

root now contains the root node of the parsed DOM.

Find Rows and Iterate Over Articles

Inspecting the page

You can notice that the items are housed inside a tag with the class athing

So articles are arranged in table rows. We use a selector to find all rows, storing them in a convenient vector structure:

GumboVector* rows = &root->v.element.children;

We can now iterate over the rows, identifying article rows using another selector - the "athing" class:

for (unsigned int i = 0; i < rows->length; ++i) {

    GumboNode* row = (GumboNode*)rows->data[i];

    if (row->v.element.tag == GUMBO_TAG_TR) {

        GumboAttribute* class_attr = gumbo_get_attribute(&row->v.element.attributes, "class");

        if (class_attr && strcmp(class_attr->value, "athing") == 0) {

            // This is an article row
            current_article = row;
            current_row_type = "article";

        }
     }
}

This allows us to process each article's data.

Extracting Article Data

With an article row selected, we can now extract information from the page elements. This is where most beginners struggle, so we'll go through each field one-by-one:

Title

We get the title with this selector - it looks for the element with "title" class within the article row:

GumboNode* title_elem = gumbo_get_element_by_class(current_article, "title");

Within that, we find the anchor tag which holds the text:

GumboNode* anchor_elem = gumbo_get_element_by_tag(title_elem, GUMBO_TAG_A);

And finally, we access the title text:

const char* article_title = anchor_elem->v.element.v.text.start;

URL

The article URL is stored in an anchor attribute:

const char* article_url = gumbo_get_attribute(&anchor_elem->v.element.attributes, "href")->value;

Points

The points element has class "subtext". Within that, we grab the first child node:

GumboNode* subtext = gumbo_get_element_by_class(row, "subtext");
const char* points = gumbo_get_text(subtext->v.element.children.data[0]);

Author

The author element has class "hnuser", nested under "subtext":

const char* author = gumbo_get_text(gumbo_get_element_by_class(subtext, "hnuser"));

Timestamp

The timestamp is stored in a title attribute:

const char* timestamp = gumbo_get_attribute(&subtext->v.element.children.data[2]->v.element.attributes, "title")->value;

Comments

For comments, we find the element with text "comments":

GumboNode* comments_elem = gumbo_get_element_by_text(subtext, "comments");

And extract the text:

const char* comments = comments_elem ? gumbo_get_text(comments_elem) : "0";

The key things to understand are:

Use selectors like classes, tags and attributes to target elements

Extract text, attribute values or sub-elements once you have a match

Nest selectors to narrow down elements

With data extracted, we can now print it out!

Print Extracted Data

The last step is to print the scraped content:

std::cout << "Title: " << article_title << std::endl;
std::cout << "URL: " << article_url << std::endl;
// etc

And with that, we have successfully scraped the articles from Hacker News!

Cleanup and Conclusion

We finish by freeing the Gumbo parsed output and cleaning up libcurl:

gumbo_destroy_output(&kGumboDefaultOptions, output);
curl_easy_cleanup(curl);

In this guide we:

Initialized libcurl and sent a GET request

Parsed HTML content with Gumbo

Selected elements using classes, tags and attributes

Extracted text, attribute values and sub-elements

Printed out scraped data for each article

Web scraping takes practice, but by breaking it down step-by-step hopefully this tutorial provided a solid foundation!

Here is the full code:

#include <iostream>
#include <string>
#include <curl/curl.h>
#include <gumbo.h>

// Callback function for libcurl to write HTTP response to a string
size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* output) {
    size_t total_size = size * nmemb;
    output->append((char*)contents, total_size);
    return total_size;
}

int main() {
    // Initialize libcurl
    CURL* curl = curl_easy_init();
    if (!curl) {
        std::cerr << "Failed to initialize libcurl" << std::endl;
        return 1;
    }

    // Define the URL of the Hacker News homepage
    std::string url = "https://news.ycombinator.com/";

    // Send a GET request to the URL
    std::string response_data;
    curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
    curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
    curl_easy_setopt(curl, CURLOPT_WRITEDATA, &response_data);

    CURLcode res = curl_easy_perform(curl);

    if (res != CURLE_OK) {
        std::cerr << "Failed to retrieve the page. Error: " << curl_easy_strerror(res) << std::endl;
        curl_easy_cleanup(curl);
        return 1;
    }

    // Initialize Gumbo parser
    GumboOutput* output = gumbo_parse(response_data.c_str());
    GumboNode* root = output->root;

    // Find all rows in the table
    GumboVector* rows = &root->v.element.children;

    // Iterate through the rows to scrape articles
    GumboNode* current_article = NULL;
    const char* current_row_type = NULL;

    for (unsigned int i = 0; i < rows->length; ++i) {
        GumboNode* row = (GumboNode*)rows->data[i];

        if (row->v.element.tag == GUMBO_TAG_TR) {
            GumboAttribute* class_attr = gumbo_get_attribute(&row->v.element.attributes, "class");

            if (class_attr && strcmp(class_attr->value, "athing") == 0) {
                // This is an article row
                current_article = row;
                current_row_type = "article";
            } else if (current_row_type && strcmp(current_row_type, "article") == 0) {
                // This is the details row
                if (current_article) {
                    // Extract information from the current article and details row
                    GumboNode* title_elem = gumbo_get_element_by_class(current_article, "title");
                    if (title_elem) {
                        GumboNode* anchor_elem = gumbo_get_element_by_tag(title_elem, GUMBO_TAG_A);
                        if (anchor_elem) {
                            const char* article_title = anchor_elem->v.element.v.text.start;
                            const char* article_url = gumbo_get_attribute(&anchor_elem->v.element.attributes, "href")->value;
                            
                            GumboNode* subtext = gumbo_get_element_by_class(row, "subtext");
                            const char* points = gumbo_get_text(subtext->v.element.children.data[0]);
                            const char* author = gumbo_get_text(gumbo_get_element_by_class(subtext, "hnuser"));
                            const char* timestamp = gumbo_get_attribute(&subtext->v.element.children.data[2]->v.element.attributes, "title")->value;
                            GumboNode* comments_elem = gumbo_get_element_by_text(subtext, "comments");
                            const char* comments = comments_elem ? gumbo_get_text(comments_elem) : "0";

                            // Print the extracted information
                            std::cout << "Title: " << article_title << std::endl;
                            std::cout << "URL: " << article_url << std::endl;
                            std::cout << "Points: " << points << std::endl;
                            std::cout << "Author: " << author << std::endl;
                            std::cout << "Timestamp: " << timestamp << std::endl;
                            std::cout << "Comments: " << comments << std::endl;
                            std::cout << "--------------------------------------------------" << std::endl;
                        }
                    }
                }

                // Reset the current article and row type
                current_article = NULL;
                current_row_type = NULL;
            }
        }
    }

    // Clean up libcurl and Gumbo
    gumbo_destroy_output(&kGumboDefaultOptions, output);
    curl_easy_cleanup(curl);

    return 0;
}

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"

We have a running offer of 1000 API calls completely free. Register and get your free API Key.

Scraping Hacker News with C++

Prerequisites

Overview

Initialize libcurl

Define URL and Send Request

Parse HTML with Gumbo

Find Rows and Iterate Over Articles

Extracting Article Data

Print Extracted Data

Cleanup and Conclusion

Browse by language:

The easiest way to do Web Scraping

Scraping Hacker News with C++

Prerequisites

Overview

Initialize libcurl

Define URL and Send Request

Parse HTML with Gumbo

Find Rows and Iterate Over Articles

Extracting Article Data

Print Extracted Data

Cleanup and Conclusion

The easiest way to do Web Scraping

Don't leave just yet!