Web Scraping Google Scholar in Node.Js

Jan 21, 2024 · 7 min read

In this beginner web scraping tutorial, we'll walk through code that scrapes search results data from Google Scholar.

This is the Google Scholar result page we are talking about…

Overview

We'll be using Node.js for web scraping, with the following key packages:

  • request-promise: Sends HTTP requests to web pages and returns a Promise for handling the response. Useful for fetching web pages.
  • cheerio: Provides jQuery-style DOM manipulation of the page content. Allows easy data extraction from HTML.
  • First we require these packages:

    const rp = require('request-promise');
    const cheerio = require('cheerio');
    

    Then we set up the initial scraper configuration:

    // Define the URL of the Google Scholar search page
    const url = "<https://scholar.google.com/scholar?hl=en&as\\_sdt=0%2C5&q=transformers&btnG=>";
    
    // Define a User-Agent header
    const headers = {
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    };
    
    // Configure the request options
    const options = {
      uri: url,
      headers: headers,
      transform: function (body) {
        return cheerio.load(body);
      }
    };
    

    This sets up the Google Scholar URL we want to scrape, adds a browser User-Agent string, and configures the request to use cheerio for HTML parsing.

    Making the Request

    With the configuration complete, we can now make the GET request:

    // Send a GET request to the URL with the User-Agent header
    rp(options)
      .then(($) => {
    
        // ... extract data here
    
      })
      .catch((error) => {
        console.error("Failed to retrieve the page:", error);
      });
    

    We pass the options to request-promise and chain a .then() to receive the cheerio-loaded HTML content. This content can now be traversed like a jQuery object.

    The .catch() handles any request errors.

    Extracting Search Result Data

    Inspecting the code

    You can see that the items are enclosed in a

    element with the class gs_ri

    Inside the .then(), we can use the cheerio $ selector to find elements and extract data:

    // Find all the search result blocks with class "gs_ri"
    const search_results = $(".gs_ri");
    
    // Loop through each search result block
    search_results.each((index, element) => {
    
      // Extract data from each result...
    
    });
    

    We grab all the .gs_ri elements, which correspond to individual search result blocks on Google Scholar.

    Then we iterate through them with .each() to extract data from each one.

    Title and URL

    Let's get the title and URL of the search result:

    // Extract the title and URL
    const title_elem = $(element).find(".gs_rt");
    const title = title_elem.text() || "N/A";
    
    const url = title_elem.find("a").attr("href") || "N/A";
    

    We use .find() to get the title element, then .text() to return its text. For robustness, || "N/A" handles missing values.

    The linked URL is extracted directly from the anchor tag's href attribute.

    Authors

    Next up is author data:

    // Extract the authors
    const authors_elem = $(element).find(".gs_a");
    const authors = authors_elem.text() || "N/A";
    

    Simply grab the inner text of the .gs_a element containing author names.

    Abstract

    Finally, we extract the abstract text:

    // Extract the abstract or description
    const abstract_elem = $(element).find(".gs_rs");
    const abstract = abstract_elem.text() || "N/A";
    

    The .gs_rs element holds the result description and abstract.

    Printing the Results

    To finish, we log out all the information extracted from each search result:

    console.log("Title:", title);
    console.log("URL:", url);
    console.log("Authors:", authors);
    console.log("Abstract:", abstract);
    
    console.log("-".repeat(50)); // Separating search results
    

    This prints the title, URL, authors and abstract for inspection. The separating line keeps each result organized in the terminal output.

    And that covers scraping key data fields from Google Scholar search results! The full code is included below to run as a complete scraper.

    Running the Scraper

    To run the web scraper code, you need:

  • Node.js installed
  • Run npm install request-promise cheerio to install packages
  • Paste the full code in a .js file
  • Run with node filename.js
  • Here is the complete Google Scholar scraping script:

    const rp = require('request-promise');
    const cheerio = require('cheerio');
    
    // Define the URL of the Google Scholar search page
    const url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=";
    
    // Define a User-Agent header
    const headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36" // Replace with your User-Agent string
    };
    
    // Configure the request options
    const options = {
        uri: url,
        headers: headers,
        transform: function (body) {
            return cheerio.load(body);
        }
    };
    
    // Send a GET request to the URL with the User-Agent header
    rp(options)
        .then(($) => {
            // Find all the search result blocks with class "gs_ri"
            const search_results = $(".gs_ri");
    
            // Loop through each search result block and extract information
            search_results.each((index, element) => {
                // Extract the title and URL
                const title_elem = $(element).find(".gs_rt");
                const title = title_elem.text() || "N/A";
                const url = title_elem.find("a").attr("href") || "N/A";
    
                // Extract the authors and publication details
                const authors_elem = $(element).find(".gs_a");
                const authors = authors_elem.text() || "N/A";
    
                // Extract the abstract or description
                const abstract_elem = $(element).find(".gs_rs");
                const abstract = abstract_elem.text() || "N/A";
    
                // Print the extracted information
                console.log("Title:", title);
                console.log("URL:", url);
                console.log("Authors:", authors);
                console.log("Abstract:", abstract);
                console.log("-".repeat(50)); // Separating search results
            });
        })
        .catch((error) => {
            console.error("Failed to retrieve the page:", error);
        });

    The output will show extracted data from search results for the query "transformers". Feel free to customize the search URL for other queries.

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
    
    

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!