Web Scraping Google Scholar in C++

Google Scholar is an invaluable resource for researching academic papers and articles. However, the search interface limits you to manually looking through results. To do more advanced research, it's helpful to be able to directly access the paper metadata - title, URL link, authors, abstract, etc.

This is the Google Scholar result page we are talking about…

The code in this article explains how to scrape a Google Scholar search URL to extract key metadata fields that you can then programmatically analyze or export elsewhere. We'll walk through the steps for a beginner audience new to web scraping.

Installations & Imports

To get started, you'll need the following:

- libcurl
- tidy HTML parser

C++ imports:
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <regex>
#include <curl/curl.h>
#include <tidy/tidy.h>
#include <tidy/buffio.h>

Make sure to install libcurl and tidy on your system and import the necessary C++ libraries.

Walkthrough

We first define two key constants:

// Define the URL of the Google Scholar search page
const std::string url = "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>";

// Define a User-Agent header
const std::string userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";

The url contains the search URL from Google Scholar that we want to scrape. The userAgent simulates a Chrome browser request so Google thinks we're accessing from a real browser.

Next we define a callback function WriteCallback that libcurl will use to collect the HTML response:

// Callback function for libcurl to write response data into a string
size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* output) {
  // .. function body ..
}

In the main() function, we initialize libcurl and provide these two constants:

// Initialize libcurl
CURL* curl = curl_easy_init();

curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
curl_easy_setopt(curl, CURLOPT_USERAGENT, userAgent.c_str());

We define a response string to store the HTML content. The WriteCallback is set as the callback handler to write the response into this string as the data is received.

We then send a GET request and check that it succeeded:

CURLcode res = curl_easy_perform(curl);

if (res == CURLE_OK) {
  // Request succeeded
} else {
   // Request failed
}

If successful, we have the HTML content. We use the Tidy parser to clean up the HTML:

// Parse HTML using Tidy
TidyDoc tidyDoc = tidyCreate();
// .. Tidy options & setup ..

tidyParseString(tidyDoc, response.c_str());

// Clean up HTML
tidyCleanAndRepair(tidyDoc);
tidyRunDiagnostics(tidyDoc);

// Save cleaned HTML
tidySaveBuffer(tidyDoc, &outputBuffer);

The formatted HTML is now stored in outputBuffer. We convert this to a string to simplify further parsing with regexes.

// Convert to string
std::string htmlContent = reinterpret_cast<char*>(outputBuffer.bp);

Extracting Data with Regular Expressions

Inspecting the code

You can see that the items are enclosed in a

element with the class gs_ri

Here is where the real scraping takes place. We define four regex patterns to match and extract the title, URL, authors, and abstract text from the HTML:

Title:

std::regex titleRegex("<h3 class=\\"gs_rt\\">(.*?)<\\\\/h3>");

- Matches opening tag for title text

(.*?) - Capturing group to match title characters

<\\\\/h3> - Closing title tag

To extract the title, we use:

std::string title = titleIterator[i].str(1);

This pulls just the captured group into the title string.

URL:

std::regex urlRegex("<a href=\\"(.*?)\\"");

- Matches opening tag for link

(.)*? - Capturing group matches URL characters

\\"> - End quote for link URL

We extract just the link URL with:

std::string url = urlIterator[i].str(1);

Authors:

std::regex authorsRegex("<div class=\\"gs_a\\">(.*?)<\\\\/div>");

- Opens div containing author info

(.)*? - Captures author text

<\\\\/div> - Closing tag

The authors are extracted via:

std::string authors = authorsIterator[i].str(1);

Abstract:

std::regex abstractRegex("<div class=\\"gs_rs\\">(.*?)<\\\\/div>");

- Opens div for abstract text

(.)*? - Matches and captures abstract content

<\\\\/div> - Closing div

We grab just the captured abstract with:

std::string abstract = abstractIterator[i].str(1);

To match multiple occurrences, we iterate through using std::sregex_iterator which finds all regex matches in the HTML. We extract and print the data from each one.

And that covers the key components for scraping the metadata!

Full Code

Here is the full code to bring the whole process together:

#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <regex>
#include <curl/curl.h>
#include <tidy/tidy.h>
#include <tidy/buffio.h>

// Define the URL of the Google Scholar search page
const std::string url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=";

// Define a User-Agent header
const std::string userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";

// Callback function for libcurl to write response data into a string
size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* output) {
    size_t totalSize = size * nmemb;
    output->append(static_cast<char*>(contents), totalSize);
    return totalSize;
}

int main() {
    // Initialize libcurl
    CURL* curl = curl_easy_init();

    if (curl) {
        // Set the URL and User-Agent header
        curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
        curl_easy_setopt(curl, CURLOPT_USERAGENT, userAgent.c_str());

        // Response string to store the HTML content
        std::string response;

        // Set the callback function to handle the response data
        curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
        curl_easy_setopt(curl, CURLOPT_WRITEDATA, &response);

        // Send a GET request
        CURLcode res = curl_easy_perform(curl);

        // Check if the request was successful (status code 200)
        if (res == CURLE_OK) {
            // Parse the HTML content using Tidy
            TidyDoc tidyDoc = tidyCreate();
            TidyBuffer outputBuffer = {0};
            TidyBuffer errBuffer = {0};

            tidyOptSetBool(tidyDoc, TidyXhtmlOut, yes);
            tidyOptSetInt(tidyDoc, TidyWrapLen, 4096);
            tidySetErrorBuffer(tidyDoc, &errBuffer);
            tidyParseString(tidyDoc, response.c_str());

            tidyCleanAndRepair(tidyDoc);
            tidyRunDiagnostics(tidyDoc);

            tidySaveBuffer(tidyDoc, &outputBuffer);
            
            // Convert the output to a string
            std::string htmlContent = reinterpret_cast<char*>(outputBuffer.bp);

            // Use regular expressions to extract information
            std::regex titleRegex("<h3 class=\"gs_rt\">(.*?)<\\/h3>");
            std::regex urlRegex("<a href=\"(.*?)\"");
            std::regex authorsRegex("<div class=\"gs_a\">(.*?)<\\/div>");
            std::regex abstractRegex("<div class=\"gs_rs\">(.*?)<\\/div>");

            std::smatch titleMatch;
            std::smatch urlMatch;
            std::smatch authorsMatch;
            std::smatch abstractMatch;

            // Find all matches in the HTML content
            std::sregex_iterator titleIterator(htmlContent.begin(), htmlContent.end(), titleRegex);
            std::sregex_iterator urlIterator(htmlContent.begin(), htmlContent.end(), urlRegex);
            std::sregex_iterator authorsIterator(htmlContent.begin(), htmlContent.end(), authorsRegex);
            std::sregex_iterator abstractIterator(htmlContent.begin(), htmlContent.end(), abstractRegex);

            // Loop through each match and extract information
            for (size_t i = 0; i < titleIterator.size(); ++i) {
                std::string title = titleIterator[i].str(1);
                std::string url = urlIterator[i].str(1);
                std::string authors = authorsIterator[i].str(1);
                std::string abstract = abstractIterator[i].str(1);

                // Print the extracted information
                std::cout << "Title: " << title << std::endl;
                std::cout << "URL: " << url << std::endl;
                std::cout << "Authors: " << authors << std::endl;
                std::cout << "Abstract: " << abstract << std::endl;
                std::cout << std::string(50, '-') << std::endl;
            }

            // Clean up
            tidyBufFree(&outputBuffer);
            tidyBufFree(&errBuffer);
            tidyRelease(tidyDoc);
        } else {
            std::cerr << "Failed to retrieve the page. CURL error code: " << res << std::endl;
        }

        // Cleanup libcurl
        curl_easy_cleanup(curl);
    } else {
        std::cerr << "Failed to initialize libcurl." << std::endl;
    }

    return 0;
}

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"

We have a running offer of 1000 API calls completely free. Register and get your free API Key.

Web Scraping Google Scholar in C++

Installations & Imports

Walkthrough

Extracting Data with Regular Expressions

Title:

- Matches opening tag for title text

URL:

Authors:

Abstract:

Full Code

Browse by language:

The easiest way to do Web Scraping

Web Scraping Google Scholar in C++

Installations & Imports

Walkthrough

Extracting Data with Regular Expressions

Title:

- Matches opening tag for title text

URL:

Authors:

Abstract:

Full Code

The easiest way to do Web Scraping

Don't leave just yet!