Web Scraping Google Scholar in Java

Web scraping is a technique for automatically extracting information from websites. In this comprehensive tutorial, we'll walk through an example Java program that scrapes search results data from Google Scholar.

This is the Google Scholar result page we are talking about…

Specifcally, we'll learn how to use the popular Jsoup Java library to connect to Google Scholar, send search queries, and scrape key bits of data - title, URL, authors, and abstract text - from the search results pages.

Prerequisites

To follow along with the code examples below, you'll need:

Java and a text editor / IDE set up for writing and running Java code

Jsoup library added to your Java project. Jsoup handles connecting to web pages and selecting page elements. More details on getting setup with Jsoup here: https://jsoup.org/download

That's it! Jsoup handles most of the heavy lifting, so we can focus on the fun data extraction parts.

Walkthrough of the Web Scraper Code

Let's break it down section by section.

Imports

We import Jsoup classes that allow connecting to web pages and selecting elements:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

Define URL and User-Agent

Next we define the Google Scholar URL we want to scrape along with a common User-Agent header:

// Define the URL of the Google Scholar search page
String url = "<https://scholar.google.com/scholar?hl=en&as\\_sdt=0%2C5&q=transformers&btnG=>";

// Define a User-Agent header
String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";

Quick web scraping tip - impersonating a real browser's User-Agent helps avoid bot detection.

Connect to URL and Select Elements

Inspecting the code

You can see that the items are enclosed in a

element with the class gs_ri

The magic happens in this section where we:

Use Jsoup to connect to the Google Scholar URL
Select all search result elements on the page with select("div.gs_ri")

// Send a GET request to the URL with the User-Agent header
Document document = Jsoup.connect(url).userAgent(userAgent).get();

// Find all the search result blocks with class "gs_ri"
Elements searchResults = document.select("div.gs_ri");

Let's break this down...

The Jsoup connect() method reaches out to the web page and downloads the HTML content. We pass our URL and User-Agent to avoid bot checks.

This HTML is stored in a Document variable that we query to extract data.

The document.select() line is where we select elements from the page. Here we target search result

tags with CSS class gs_ri using this selector syntax:

div.gs_ri

All matching elements get stored in an Elements collection that we can now loop through.

Pro tip: Install browser developer tools to inspect elements and test selectors.

Extract Data from Search Results

With search result elements selected, we can traverse each one and extract the inner text and attributes:

// Loop through each search result block and extract information
for (Element result : searchResults) {

  // Extract the title and URL
  Element titleElement = result.selectFirst("h3.gs_rt");
  String title = titleElement != null ? titleElement.text() : "N/A";
  String resultUrl = titleElement != null ? titleElement.selectFirst("a").attr("href") : "N/A";

  // Extract the authors and publication details
  Element authorsElement = result.selectFirst("div.gs_a");
  String authors = authorsElement != null ? authorsElement.text() : "N/A";

  // Extract the abstract or description
  Element abstractElement = result.selectFirst("div.gs_rs");
  String abstractText = abstractElement != null ? abstractElement.text() : "N/A";

  // Print the extracted information
  System.out.println("Title: " + title);
  System.out.println("URL: " + resultUrl);
  System.out.println("Authors: " + authors);
  System.out.println("Abstract: " + abstractText);
  System.out.println("-".repeat(50)); // Separating search results
}

We loop through each previously selected

element, and extract data by targeting specific child tags:

Title - Select the

tag with class gs_rt, get .text()

URL - Get anchor tag within title element, get href attribute

Authors - Select

with class gs_a, get .text()

Abstract - Select

with class gs_rs, get .text()

The scraped pieces of data are printed, with each search result separated by dashes.

And that's it! The full code connects to Google Scholar, scrapes results, and extracts key pieces of data from each one.

Let's quickly summarize the key concepts:

Use Jsoup to connect to web pages - Handling sessions, cookies, headers

Select elements with CSS-style selectors

Extract data - text, attributes, HTML

Loop through many elements for mass data scraping

This core scraper recipe can be adapted to pull data from almost any site.

Full Java Code for Scraping Google Scholar

Here is the complete code example for scraping search results data from Google Scholar:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;

public class GoogleScholarScraper {

    public static void main(String[] args) {
        // Define the URL of the Google Scholar search page
        String url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=";

        // Define a User-Agent header
        String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";

        try {
            // Send a GET request to the URL with the User-Agent header
            Document document = Jsoup.connect(url).userAgent(userAgent).get();

            // Find all the search result blocks with class "gs_ri"
            Elements searchResults = document.select("div.gs_ri");

            // Loop through each search result block and extract information
            for (Element result : searchResults) {
                // Extract the title and URL
                Element titleElement = result.selectFirst("h3.gs_rt");
                String title = titleElement != null ? titleElement.text() : "N/A";
                String resultUrl = titleElement != null ? titleElement.selectFirst("a").attr("href") : "N/A";

                // Extract the authors and publication details
                Element authorsElement = result.selectFirst("div.gs_a");
                String authors = authorsElement != null ? authorsElement.text() : "N/A";

                // Extract the abstract or description
                Element abstractElement = result.selectFirst("div.gs_rs");
                String abstractText = abstractElement != null ? abstractElement.text() : "N/A";

                // Print the extracted information
                System.out.println("Title: " + title);
                System.out.println("URL: " + resultUrl);
                System.out.println("Authors: " + authors);
                System.out.println("Abstract: " + abstractText);
                System.out.println("-".repeat(50)); // Separating search results
            }
        } catch (IOException e) {
            System.err.println("Failed to retrieve the page. Error: " + e.getMessage());
        }
    }
}

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"

We have a running offer of 1000 API calls completely free. Register and get your free API Key.

Web Scraping Google Scholar in Java

Prerequisites

Walkthrough of the Web Scraper Code

Imports

Define URL and User-Agent

Connect to URL and Select Elements

Extract Data from Search Results

tag with class gs_rt, get .text()

Full Java Code for Scraping Google Scholar

Browse by language:

The easiest way to do Web Scraping

Web Scraping Google Scholar in Java

Prerequisites

Walkthrough of the Web Scraper Code

Imports

Define URL and User-Agent

Connect to URL and Select Elements

Extract Data from Search Results

tag with class gs_rt, get .text()

Full Java Code for Scraping Google Scholar

The easiest way to do Web Scraping

Don't leave just yet!