Web Scraping Google Scholar in R

This article explains how to scrape Google Scholar search results in R by walking through a fully-functional example script. We will extract the title, URL, authors, and abstract for each search result.

This is the Google Scholar result page we are talking about…

Prerequisites

To follow along, you'll need:

R installed on your machine

The rvest package installed, which we'll use for scraping. Install by running install.packages("rvest")

Sending a Request

We begin by defining the URL of a Google Scholar search page:

url <- "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>"

Next we set a User-Agent header to spoof a regular browser request:

user_agent <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"

We send a GET request to the URL using read_html(), passing in our user agent:

page <- read_html(url, user_agent(user_agent))

Finally, we check if the request succeeded by verifying the status code is 200:

if (status_code(page) == 200) {

  // Scrape page

} else {

  // Handle error

}

So in just a few lines, we've programmatically sent a request to Google Scholar posing as a real browser!

Extracting Search Results

Inspecting the code

You can see that the items are enclosed in a

element with the class gs_ri

Now that we have the page contents, we can parse through it to extract data.

Google Scholar conveniently puts each search result within a

tag having class gs_ri. We use the html_nodes() function to find all of these nodes and loop through them:

search_results <- html_nodes(page, ".gs_ri")

for (result in search_results) {

  // Extract data from each result

}

Title and URL

Within each search result

, the title is contained in an element with class gs_rt. We use html_node() to find it, html_text() to extract the text, and html_attr() to get the href linking to the paper.

title_elem <- html_node(result, ".gs_rt")

title <- html_text(title_elem, trim = TRUE)

url <- html_attr(title_elem, "href")

Authors

The authors are stored similarly, under an element with class gs_a:

authors_elem <- html_node(result, ".gs_a")

authors <- html_text(authors_elem, trim = TRUE)

Abstract

Finally, the abstract lives under gs_rs:

abstract_elem <- html_node(result, ".gs_rs")

abstract <- html_text(abstract_elem, trim = TRUE)

By using the element classes as CSS selectors, we've cleanly extracted all the data we want!

Printing the Results

We wrap up by printing out the extracted information - title, URL, authors, and abstract for diagnostics. The full contents are now programmatically accessible for further analysis.

cat("Title:", title, "\\n")

cat("URL:", url, "\\n")

cat("Authors:", authors, "\\n")

cat("Abstract:", abstract, "\\n")

cat("-" * 50, "\\n") # Separator

The key in scraping is meticulously analyzing the HTML structure to locate the data you want. Tools like browser developer consoles are invaluable for this.

Once you've identified the right selectors, parsing and extracting becomes straightforward.

Full Code

Here is the complete script for reference:

# Load the required libraries
library(rvest)

# Define the URL of the Google Scholar search page
url <- "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG="

# Define a User-Agent header
user_agent <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"

# Send a GET request to the URL with the User-Agent header
page <- read_html(url, user_agent(user_agent))

# Check if the request was successful (HTTP status code 200)
if (status_code(page) == 200) {
  # Find all the search result blocks with class "gs_ri"
  search_results <- html_nodes(page, ".gs_ri")

  # Loop through each search result block and extract information
  for (result in search_results) {
    # Extract the title and URL
    title_elem <- html_node(result, ".gs_rt")
    title <- html_text(title_elem, trim = TRUE)
    url <- html_attr(title_elem, "href")

    # Extract the authors and publication details
    authors_elem <- html_node(result, ".gs_a")
    authors <- html_text(authors_elem, trim = TRUE)

    # Extract the abstract or description
    abstract_elem <- html_node(result, ".gs_rs")
    abstract <- html_text(abstract_elem, trim = TRUE)

    # Print the extracted information
    cat("Title:", title, "\n")
    cat("URL:", url, "\n")
    cat("Authors:", authors, "\n")
    cat("Abstract:", abstract, "\n")
    cat("-" * 50, "\n")  # Separating search results
  }
} else {
  cat("Failed to retrieve the page. Status code:", status_code(page), "\n")
}

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"

We have a running offer of 1000 API calls completely free. Register and get your free API Key.

Web Scraping Google Scholar in R

Prerequisites

Sending a Request

Extracting Search Results

Title and URL

Authors

Abstract

Printing the Results

Full Code

Browse by language:

The easiest way to do Web Scraping

Web Scraping Google Scholar in R

Prerequisites

Sending a Request

Extracting Search Results

Title and URL

Authors

Abstract

Printing the Results

Full Code

The easiest way to do Web Scraping

Don't leave just yet!