Web Scraping Google Scholar in C++

Jan 21, 2024 · 8 min read

Google Scholar is an invaluable resource for researching academic papers and articles. However, the search interface limits you to manually looking through results. To do more advanced research, it's helpful to be able to directly access the paper metadata - title, URL link, authors, abstract, etc.

This is the Google Scholar result page we are talking about…

The code in this article explains how to scrape a Google Scholar search URL to extract key metadata fields that you can then programmatically analyze or export elsewhere. We'll walk through the steps for a beginner audience new to web scraping.

Installations & Imports

To get started, you'll need the following:

- libcurl
- tidy HTML parser

C++ imports:
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <regex>
#include <curl/curl.h>
#include <tidy/tidy.h>
#include <tidy/buffio.h>

Make sure to install libcurl and tidy on your system and import the necessary C++ libraries.

Walkthrough

We first define two key constants:

// Define the URL of the Google Scholar search page
const std::string url = "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>";

// Define a User-Agent header
const std::string userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";

The url contains the search URL from Google Scholar that we want to scrape. The userAgent simulates a Chrome browser request so Google thinks we're accessing from a real browser.

Next we define a callback function WriteCallback that libcurl will use to collect the HTML response:

// Callback function for libcurl to write response data into a string
size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* output) {
  // .. function body ..
}

In the main() function, we initialize libcurl and provide these two constants:

// Initialize libcurl
CURL* curl = curl_easy_init();

curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
curl_easy_setopt(curl, CURLOPT_USERAGENT, userAgent.c_str());

We define a response string to store the HTML content. The WriteCallback is set as the callback handler to write the response into this string as the data is received.

We then send a GET request and check that it succeeded:

CURLcode res = curl_easy_perform(curl);

if (res == CURLE_OK) {
  // Request succeeded
} else {
   // Request failed
}

If successful, we have the HTML content. We use the Tidy parser to clean up the HTML:

// Parse HTML using Tidy
TidyDoc tidyDoc = tidyCreate();
// .. Tidy options & setup ..

tidyParseString(tidyDoc, response.c_str());

// Clean up HTML
tidyCleanAndRepair(tidyDoc);
tidyRunDiagnostics(tidyDoc);

// Save cleaned HTML
tidySaveBuffer(tidyDoc, &outputBuffer);

The formatted HTML is now stored in outputBuffer. We convert this to a string to simplify further parsing with regexes.

// Convert to string
std::string htmlContent = reinterpret_cast<char*>(outputBuffer.bp);

Extracting Data with Regular Expressions

Inspecting the code

You can see that the items are enclosed in a

element with the class gs_ri

Here is where the real scraping takes place. We define four regex patterns to match and extract the title, URL, authors, and abstract text from the HTML:

Title:

std::regex titleRegex("<h3 class=\\"gs_rt\\">(.*?)<\\\\/h3>");
  • - Matches opening tag for title text

  • (.*?) - Capturing group to match title characters
  • <\\\\/h3> - Closing title tag
  • To extract the title, we use:

    std::string title = titleIterator[i].str(1);
    

    This pulls just the captured group into the title string.

    URL:

    std::regex urlRegex("<a href=\\"(.*?)\\"");
    
  • - Matches opening tag for link
  • (.)*? - Capturing group matches URL characters
  • \\"> - End quote for link URL
  • We extract just the link URL with:

    Authors:

  • - Opens div containing author info
  • (.)*? - Captures author text
  • <\\\\/div> - Closing tag
  • The authors are extracted via:

    Abstract:

  • - Opens div for abstract text
  • (.)*? - Matches and captures abstract content
  • <\\\\/div> - Closing div
  • We grab just the captured abstract with:

    To match multiple occurrences, we iterate through using std::sregex_iterator which finds all regex matches in the HTML. We extract and print the data from each one.

    And that covers the key components for scraping the metadata!

    Full Code

    Here is the full code to bring the whole process together:

    #include <iostream>
    #include <string>
    #include <vector>
    #include <algorithm>
    #include <regex>
    #include <curl/curl.h>
    #include <tidy/tidy.h>
    #include <tidy/buffio.h>
    
    // Define the URL of the Google Scholar search page
    const std::string url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=";
    
    // Define a User-Agent header
    const std::string userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";
    
    // Callback function for libcurl to write response data into a string
    size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* output) {
        size_t totalSize = size * nmemb;
        output->append(static_cast<char*>(contents), totalSize);
        return totalSize;
    }
    
    int main() {
        // Initialize libcurl
        CURL* curl = curl_easy_init();
    
        if (curl) {
            // Set the URL and User-Agent header
            curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
            curl_easy_setopt(curl, CURLOPT_USERAGENT, userAgent.c_str());
    
            // Response string to store the HTML content
            std::string response;
    
            // Set the callback function to handle the response data
            curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
            curl_easy_setopt(curl, CURLOPT_WRITEDATA, &response);
    
            // Send a GET request
            CURLcode res = curl_easy_perform(curl);
    
            // Check if the request was successful (status code 200)
            if (res == CURLE_OK) {
                // Parse the HTML content using Tidy
                TidyDoc tidyDoc = tidyCreate();
                TidyBuffer outputBuffer = {0};
                TidyBuffer errBuffer = {0};
    
                tidyOptSetBool(tidyDoc, TidyXhtmlOut, yes);
                tidyOptSetInt(tidyDoc, TidyWrapLen, 4096);
                tidySetErrorBuffer(tidyDoc, &errBuffer);
                tidyParseString(tidyDoc, response.c_str());
    
                tidyCleanAndRepair(tidyDoc);
                tidyRunDiagnostics(tidyDoc);
    
                tidySaveBuffer(tidyDoc, &outputBuffer);
                
                // Convert the output to a string
                std::string htmlContent = reinterpret_cast<char*>(outputBuffer.bp);
    
                // Use regular expressions to extract information
                std::regex titleRegex("<h3 class=\"gs_rt\">(.*?)<\\/h3>");
                std::regex urlRegex("<a href=\"(.*?)\"");
                std::regex authorsRegex("<div class=\"gs_a\">(.*?)<\\/div>");
                std::regex abstractRegex("<div class=\"gs_rs\">(.*?)<\\/div>");
    
                std::smatch titleMatch;
                std::smatch urlMatch;
                std::smatch authorsMatch;
                std::smatch abstractMatch;
    
                // Find all matches in the HTML content
                std::sregex_iterator titleIterator(htmlContent.begin(), htmlContent.end(), titleRegex);
                std::sregex_iterator urlIterator(htmlContent.begin(), htmlContent.end(), urlRegex);
                std::sregex_iterator authorsIterator(htmlContent.begin(), htmlContent.end(), authorsRegex);
                std::sregex_iterator abstractIterator(htmlContent.begin(), htmlContent.end(), abstractRegex);
    
                // Loop through each match and extract information
                for (size_t i = 0; i < titleIterator.size(); ++i) {
                    std::string title = titleIterator[i].str(1);
                    std::string url = urlIterator[i].str(1);
                    std::string authors = authorsIterator[i].str(1);
                    std::string abstract = abstractIterator[i].str(1);
    
                    // Print the extracted information
                    std::cout << "Title: " << title << std::endl;
                    std::cout << "URL: " << url << std::endl;
                    std::cout << "Authors: " << authors << std::endl;
                    std::cout << "Abstract: " << abstract << std::endl;
                    std::cout << std::string(50, '-') << std::endl;
                }
    
                // Clean up
                tidyBufFree(&outputBuffer);
                tidyBufFree(&errBuffer);
                tidyRelease(tidyDoc);
            } else {
                std::cerr << "Failed to retrieve the page. CURL error code: " << res << std::endl;
            }
    
            // Cleanup libcurl
            curl_easy_cleanup(curl);
        } else {
            std::cerr << "Failed to initialize libcurl." << std::endl;
        }
    
        return 0;
    }
    

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!