Web Scraping Google Scholar in Node.Js

In this beginner web scraping tutorial, we'll walk through code that scrapes search results data from Google Scholar.

This is the Google Scholar result page we are talking about…

Overview

We'll be using Node.js for web scraping, with the following key packages:

request-promise: Sends HTTP requests to web pages and returns a Promise for handling the response. Useful for fetching web pages.

cheerio: Provides jQuery-style DOM manipulation of the page content. Allows easy data extraction from HTML.

First we require these packages:

const rp = require('request-promise');
const cheerio = require('cheerio');

Then we set up the initial scraper configuration:

// Define the URL of the Google Scholar search page
const url = "<https://scholar.google.com/scholar?hl=en&as\\_sdt=0%2C5&q=transformers&btnG=>";

// Define a User-Agent header
const headers = {
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
};

// Configure the request options
const options = {
  uri: url,
  headers: headers,
  transform: function (body) {
    return cheerio.load(body);
  }
};

This sets up the Google Scholar URL we want to scrape, adds a browser User-Agent string, and configures the request to use cheerio for HTML parsing.

Making the Request

With the configuration complete, we can now make the GET request:

// Send a GET request to the URL with the User-Agent header
rp(options)
  .then(($) => {

    // ... extract data here

  })
  .catch((error) => {
    console.error("Failed to retrieve the page:", error);
  });

We pass the options to request-promise and chain a .then() to receive the cheerio-loaded HTML content. This content can now be traversed like a jQuery object.

The .catch() handles any request errors.

Extracting Search Result Data

Inspecting the code

You can see that the items are enclosed in a

element with the class gs_ri

Inside the .then(), we can use the cheerio $ selector to find elements and extract data:

// Find all the search result blocks with class "gs_ri"
const search_results = $(".gs_ri");

// Loop through each search result block
search_results.each((index, element) => {

  // Extract data from each result...

});

We grab all the .gs_ri elements, which correspond to individual search result blocks on Google Scholar.

Then we iterate through them with .each() to extract data from each one.

Title and URL

Let's get the title and URL of the search result:

// Extract the title and URL
const title_elem = $(element).find(".gs_rt");
const title = title_elem.text() || "N/A";

const url = title_elem.find("a").attr("href") || "N/A";

We use .find() to get the title element, then .text() to return its text. For robustness, || "N/A" handles missing values.

The linked URL is extracted directly from the anchor tag's href attribute.

Authors

Next up is author data:

// Extract the authors
const authors_elem = $(element).find(".gs_a");
const authors = authors_elem.text() || "N/A";

Simply grab the inner text of the .gs_a element containing author names.

Abstract

Finally, we extract the abstract text:

// Extract the abstract or description
const abstract_elem = $(element).find(".gs_rs");
const abstract = abstract_elem.text() || "N/A";

The .gs_rs element holds the result description and abstract.

Printing the Results

To finish, we log out all the information extracted from each search result:

console.log("Title:", title);
console.log("URL:", url);
console.log("Authors:", authors);
console.log("Abstract:", abstract);

console.log("-".repeat(50)); // Separating search results

This prints the title, URL, authors and abstract for inspection. The separating line keeps each result organized in the terminal output.

And that covers scraping key data fields from Google Scholar search results! The full code is included below to run as a complete scraper.

Running the Scraper

To run the web scraper code, you need:

Node.js installed

Run npm install request-promise cheerio to install packages

Paste the full code in a .js file

Run with node filename.js

Here is the complete Google Scholar scraping script:

const rp = require('request-promise');
const cheerio = require('cheerio');

// Define the URL of the Google Scholar search page
const url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=";

// Define a User-Agent header
const headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36" // Replace with your User-Agent string
};

// Configure the request options
const options = {
    uri: url,
    headers: headers,
    transform: function (body) {
        return cheerio.load(body);
    }
};

// Send a GET request to the URL with the User-Agent header
rp(options)
    .then(($) => {
        // Find all the search result blocks with class "gs_ri"
        const search_results = $(".gs_ri");

        // Loop through each search result block and extract information
        search_results.each((index, element) => {
            // Extract the title and URL
            const title_elem = $(element).find(".gs_rt");
            const title = title_elem.text() || "N/A";
            const url = title_elem.find("a").attr("href") || "N/A";

            // Extract the authors and publication details
            const authors_elem = $(element).find(".gs_a");
            const authors = authors_elem.text() || "N/A";

            // Extract the abstract or description
            const abstract_elem = $(element).find(".gs_rs");
            const abstract = abstract_elem.text() || "N/A";

            // Print the extracted information
            console.log("Title:", title);
            console.log("URL:", url);
            console.log("Authors:", authors);
            console.log("Abstract:", abstract);
            console.log("-".repeat(50)); // Separating search results
        });
    })
    .catch((error) => {
        console.error("Failed to retrieve the page:", error);
    });

The output will show extracted data from search results for the query "transformers". Feel free to customize the search URL for other queries.

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"

We have a running offer of 1000 API calls completely free. Register and get your free API Key.

Web Scraping Google Scholar in Node.Js

Overview

Making the Request

Extracting Search Result Data

Title and URL

Authors

Abstract

Printing the Results

Running the Scraper

Browse by language:

The easiest way to do Web Scraping

Web Scraping Google Scholar in Node.Js

Overview

Making the Request

Extracting Search Result Data

Title and URL

Authors

Abstract

Printing the Results

Running the Scraper

The easiest way to do Web Scraping

Don't leave just yet!