Scraping New York Times News Headlines in C++

Dec 6, 2023 · 6 min read

Web scraping is a technique for extracting data from websites automatically. It can be useful for collecting large volumes of data for analysis. In this guide, we'll walk through a program to scrape article titles and links from The New York Times using C++.

Key Concepts

To follow along, you'll need a basic understanding of:

  • HTTP requests - How data is transferred on the web
  • HTML - Structure of web page content
  • C++ - Programming language we'll use
  • libcurl - C++ library for transferring data with URLs
  • Gumbo - C++ library for parsing HTML
  • Don't worry if you're unfamiliar with these! We'll explain each piece as we go.

    Step 1: Send an HTTP Request and Get the HTML

    We'll use the libcurl library to send an HTTP GET request to fetch the NYTimes HTML:

    CURL* curl = curl_easy_init();
    curl_easy_setopt(curl, CURLOPT_URL, "<https://www.nytimes.com/>");
    
    // Store response in string
    std::string html;
    curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
    curl_easy_setopt(curl, CURLOPT_WRITEDATA, &html);
    
    curl_easy_perform(curl);
    curl_easy_cleanup(curl);
    

    This makes a request just like your browser does, except instead of rendering the HTML, we save it as a string to parse later.

    Note: We use a callback function to accumulate the data. I won't cover the details here but the curl docs explain it well.

    Step 2: Parse the HTML

    Next we'll use the Gumbo HTML parser to analyze the HTML content:

    GumboOutput* output = gumbo_parse(html.c_str());
    

    This converts the HTML string into a structured format we can traverse programmatically.

    Step 3: Find Article Elements

    Inspecting the page

    We now inspect element in chrome to see how the code is structured…

    You can see that the articles are contained inside section tags and with the class story-wrapper

    We want to extract articles specifically. These are wrapped in

    tags with a "story-wrapper" class on the NYTimes homepage.

    We'll walk through the parsed HTML tree to find them:

    GumboNode* root = output->root;
    
    for (GumboNode* child : root->children) {
        if (child->type == GUMBO_NODE_ELEMENT &&
            child->tag == GUMBO_TAG_SECTION) {
    
            GumboAttribute* class_attr = child->get_attribute("class");
    
            if (class_attr->value == "story-wrapper") {
                // Found article element
            }
       }
    }
    

    Here we:

    1. Get root element
    2. Loop through its children
    3. Check if it's a
      tag
    4. Fetch the class attribute
    5. Check if it contains "story-wrapper"

    This filters down the millions of elements to just the articles.

    Step 4: Extract Title and Links

    Now we can dig into our found article elements to get the title and link. These are stored in specific child elements we can search for:

    // Within found article element
    for (GumboNode* inner : child->children) {
    
        if (inner->tag == GUMBO_TAG_H2) {
            // Title element
            std::string title = inner->text->text;
        }
    
        else if (inner->tag == GUMBO_TAG_A) {
            // Link element
            std::string url = inner->get_attribute("href")->value;
        }
    
    }
    

    And we have our data! The full code at the bottom puts this all together into a program that prints out titles and links.

    Key Takeaways

    The scraping process mainly involves:

    1. Getting HTML data
    2. Parsing into structured format
    3. Traversing parsed DOM to extract relevant data

    There are lots of optimizations possible but this covers the core technique. You could build upon this to fetch entire articles content or add caching for example.

    Hope this gives you a template for getting started with scraping in C++!

    Full Code

    #include <iostream>
    #include <curl/curl.h>
    #include <gumbo.h>
    
    // Callback passed to curl to accumulate response data
    size_t WriteCallback(void *contents, size_t size, size_t nmemb, std::string *userp) {
      ((std::string*)userp)->append((char*)contents, size * nmemb);
      return size * nmemb;
    }
    
    int main() {
    
      // Fetch HTML
      CURL* curl = curl_easy_init();
    
      curl_easy_setopt(curl, CURLOPT_URL, "<https://nytimes.com>");
    
      std::string html;
      curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
      curl_easy_setopt(curl, CURLOPT_WRITEDATA, &html);
    
      CURLcode res = curl_easy_perform(curl);
    
      if(res != CURLE_OK) {
          std::cerr << "curl_easy_perform() failed: " << curl_easy_strerror(res);
      }
    
      curl_easy_cleanup(curl);
    
      // Parse HTML
      GumboOutput* output = gumbo_parse(html.c_str());
    
      // Find article elements
      GumboNode* root = output->root;
    
      for (GumboNode* child : root->v.element.children) {
         if (child->type != GUMBO_NODE_ELEMENT) {
           continue;
         }
    
         if (child->v.element.tag != GUMBO_TAG_SECTION) {
           continue;
         }
    
         GumboAttribute* class_attr = gumbo_get_attribute(&child->v.element.attributes, "class");
    
         if (!class_attr) {
            continue;
         }
    
         if (class_attr->value != "story-wrapper") {
            continue;
         }
    
         // Extract data
         for (GumboNode* inner : child->v.element.children) {
    
            if (inner->type != GUMBO_NODE_ELEMENT){
              continue;
            }
    
            if (inner->v.element.tag == GUMBO_TAG_H2) {
              std::cout << inner->v.text.text << std::endl;
            }
    
            if (inner->v.element.tag == GUMBO_TAG_A
                && inner->v.element.attributes.length > 0) {
    
              GumboAttribute* href = gumbo_get_attribute(&inner->v.element.attributes, "href");
    
              if (!href) {
                continue;
              }
    
              std::cout << href->value << std::endl << std::endl;
            }
         }
      }
    
      return 0;
    }
    

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!