Web Scraping Google Scholar in Kotlin

Jan 21, 2024 · 7 min read

Google Scholar is an excellent resource for finding scholarly articles and studies on any topic. The search engine provides detailed information on publications, including the title, authors, abstract, citations, and more. This wealth of data also makes Google Scholar pages prime targets for web scraping.

This is the Google Scholar result page we are talking about…

In this beginner tutorial, we will walk through a full code example for scraping key details from Google Scholar search results using Jsoup in Kotlin.

Required Packages

To scrape web pages, we need a Java library that can retrieve and parse HTML content. Jsoup is a popular option that makes it easy to extract and manipulate data from HTML documents using a jQuery-style selector API.

We import the main Jsoup class along with Document and Elements which we will use to represent the parsed page and selected elements:

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element
import org.jsoup.select.Elements

Walking Through the Code

Let's break down this full web scraping script step-by-step:

Define Target URL

We specify the root Google Scholar search URL that we want to scrape:

val url = "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>"

This URL searches Google Scholar for the term "transformers".

Set a User-Agent Header

Many sites block requests missing a valid User-Agent string to prevent spam bots and scrapers. So we define a browser User-Agent:

val userAgent =
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"

This mimics a Chrome browser on Windows.

Send GET Request

We use Jsoup to connect to the target URL and pass the User-Agent header to avoid blocks:

val document: Document = Jsoup.connect(url).userAgent(userAgent).get()

The get() method sends the request and retrieves the page HTML content as a parsed Document.

Check Page Load Success

It's good practice to verify the page loaded properly before scraping:

if (document.title().contains("Google Scholar")) {

  // scraping code here

} else {

  println("Failed to retrieve the page.")

}

We simply check if the document title contains "Google Scholar" to confirm success.

Find Search Result Elements

Inspecting the code

You can see that the items are enclosed in a

element with the class gs_ri

The key data we want exists within

elements having class gs_ri - the individual search result blocks.

We use Jsoup's selector syntax to find all of them:

val searchResults: Elements = document.select("div.gs_ri")

This returns an Elements collection containing the search result elements to iterate through.

The selector div.gs_ri targets

tags with a class attribute equal to gs_ri.

Jsoup selectors work much like jQuery or CSS by using tag names, IDs, classes, attributes, and more to target elements.

Loop Through Results

With the search result elements found, we loop through each one to extract data:

for (result: Element in searchResults) {

  // extract data from each result element

}

result will represent the current search result

as we iterate.

Extract Title

Inside the loop, we can now query within each result element to extract specific pieces of information.

To get the title, we select the

tag with class gs_rt and access its .text() value:

val titleElem: Element? = result.selectFirst("h3.gs_rt")
val title: String = titleElem?.text() ?: "N/A"

selectFirst() directly returns an Element. We use null-safe operators like ?. to prevent errors if it does not exist.

Extract URL

To get the linked URL within the h3 title element:

val url: String = titleElem?.selectFirst("a")?.attr("href") ?: "N/A"

Here we select the child link inside titleElem and retrieve its href attribute with .attr().

Extract Authors & Details

For author names and other metadata shown below the title:

We grab the

with class gs_a and get its internal .text().

Extract the Abstract

Finally, to obtain the paper's abstract or excerpt text within its containing element:

val abstractElem: Element? = result.selectFirst("div.gs_rs")
val abstract: String = abstractElem?.text() ?: "N/A"

Print Scraped Information

As the last step inside the loop, we print out the scraped info neatly:

println("Title: $title")
println("URL: $url")
println("Authors: $authors")
println("Abstract: $abstract")

println("-".repeat(50)) // Separator lines between results

When finished, we'll have extracted the core metadata from Google Scholar for each search result.

Summary

That covers the key steps to scrape a Google Scholar search page with Kotlin and Jsoup:

  1. Import Jsoup and models
  2. Define the target URL
  3. Set a User-Agent string
  4. Send GET request
  5. Check page load status
  6. Use selectors to extract elements
  7. Loop through elements
  8. Print scraped data

Next we'll cover the basics of getting Jsoup installed and set up.

Installation

To use Jsoup, you need:

  • Java JDK 8+
  • A build tool like Gradle or Maven
  • Add the Jsoup dependency
  • Here is sample Gradle configuration:

    dependencies {
      implementation 'org.jsoup:jsoup:1.14.3'
    }
    

    Now you can import Jsoup and start loading web pages.

    Full Code Example

    Here again is the full Google Scholar scraping script covered in this guide:

    import org.jsoup.Jsoup
    import org.jsoup.nodes.Document
    import org.jsoup.nodes.Element
    import org.jsoup.select.Elements
    
    fun main() {
        // Define the URL of the Google Scholar search page
        val url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG="
    
        // Define a User-Agent header
        val userAgent =
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    
        // Send a GET request to the URL with the User-Agent header
        val document: Document = Jsoup.connect(url).userAgent(userAgent).get()
    
        // Check if the request was successful (status code 200)
        if (document.title().contains("Google Scholar")) {
            // Find all the search result blocks with class "gs_ri"
            val searchResults: Elements = document.select("div.gs_ri")
    
            // Loop through each search result block and extract information
            for (result: Element in searchResults) {
                // Extract the title and URL
                val titleElem: Element? = result.selectFirst("h3.gs_rt")
                val title: String = titleElem?.text() ?: "N/A"
                val url: String = titleElem?.selectFirst("a")?.attr("href") ?: "N/A"
    
                // Extract the authors and publication details
                val authorsElem: Element? = result.selectFirst("div.gs_a")
                val authors: String = authorsElem?.text() ?: "N/A"
    
                // Extract the abstract or description
                val abstractElem: Element? = result.selectFirst("div.gs_rs")
                val abstract: String = abstractElem?.text() ?: "N/A"
    
                // Print the extracted information
                println("Title: $title")
                println("URL: $url")
                println("Authors: $authors")
                println("Abstract: $abstract")
                println("-".repeat(50)) // Separating search results
            }
        } else {
            println("Failed to retrieve the page.")
        }
    }

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
    
    

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!