Web Scraping Google Scholar in R

Jan 21, 2024 · 6 min read

This article explains how to scrape Google Scholar search results in R by walking through a fully-functional example script. We will extract the title, URL, authors, and abstract for each search result.

This is the Google Scholar result page we are talking about…

Prerequisites

To follow along, you'll need:

  • R installed on your machine
  • The rvest package installed, which we'll use for scraping. Install by running install.packages("rvest")
  • Sending a Request

    We begin by defining the URL of a Google Scholar search page:

    url <- "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>"
    

    Next we set a User-Agent header to spoof a regular browser request:

    user_agent <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    

    We send a GET request to the URL using read_html(), passing in our user agent:

    page <- read_html(url, user_agent(user_agent))
    

    Finally, we check if the request succeeded by verifying the status code is 200:

    if (status_code(page) == 200) {
    
      // Scrape page
    
    } else {
    
      // Handle error
    
    }
    

    So in just a few lines, we've programmatically sent a request to Google Scholar posing as a real browser!

    Extracting Search Results

    Inspecting the code

    You can see that the items are enclosed in a

    element with the class gs_ri

    Now that we have the page contents, we can parse through it to extract data.

    Google Scholar conveniently puts each search result within a

    tag having class gs_ri. We use the html_nodes() function to find all of these nodes and loop through them:

    search_results <- html_nodes(page, ".gs_ri")
    
    for (result in search_results) {
    
      // Extract data from each result
    
    }
    

    Title and URL

    Within each search result

    , the title is contained in an element with class gs_rt. We use html_node() to find it, html_text() to extract the text, and html_attr() to get the href linking to the paper.

    title_elem <- html_node(result, ".gs_rt")
    
    title <- html_text(title_elem, trim = TRUE)
    
    url <- html_attr(title_elem, "href")
    

    Authors

    The authors are stored similarly, under an element with class gs_a:

    authors_elem <- html_node(result, ".gs_a")
    
    authors <- html_text(authors_elem, trim = TRUE)
    

    Abstract

    Finally, the abstract lives under gs_rs:

    abstract_elem <- html_node(result, ".gs_rs")
    
    abstract <- html_text(abstract_elem, trim = TRUE)
    

    By using the element classes as CSS selectors, we've cleanly extracted all the data we want!

    Printing the Results

    We wrap up by printing out the extracted information - title, URL, authors, and abstract for diagnostics. The full contents are now programmatically accessible for further analysis.

    cat("Title:", title, "\\n")
    
    cat("URL:", url, "\\n")
    
    cat("Authors:", authors, "\\n")
    
    cat("Abstract:", abstract, "\\n")
    
    cat("-" * 50, "\\n") # Separator
    

    The key in scraping is meticulously analyzing the HTML structure to locate the data you want. Tools like browser developer consoles are invaluable for this.

    Once you've identified the right selectors, parsing and extracting becomes straightforward.

    Full Code

    Here is the complete script for reference:

    # Load the required libraries
    library(rvest)
    
    # Define the URL of the Google Scholar search page
    url <- "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG="
    
    # Define a User-Agent header
    user_agent <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    
    # Send a GET request to the URL with the User-Agent header
    page <- read_html(url, user_agent(user_agent))
    
    # Check if the request was successful (HTTP status code 200)
    if (status_code(page) == 200) {
      # Find all the search result blocks with class "gs_ri"
      search_results <- html_nodes(page, ".gs_ri")
    
      # Loop through each search result block and extract information
      for (result in search_results) {
        # Extract the title and URL
        title_elem <- html_node(result, ".gs_rt")
        title <- html_text(title_elem, trim = TRUE)
        url <- html_attr(title_elem, "href")
    
        # Extract the authors and publication details
        authors_elem <- html_node(result, ".gs_a")
        authors <- html_text(authors_elem, trim = TRUE)
    
        # Extract the abstract or description
        abstract_elem <- html_node(result, ".gs_rs")
        abstract <- html_text(abstract_elem, trim = TRUE)
    
        # Print the extracted information
        cat("Title:", title, "\n")
        cat("URL:", url, "\n")
        cat("Authors:", authors, "\n")
        cat("Abstract:", abstract, "\n")
        cat("-" * 50, "\n")  # Separating search results
      }
    } else {
      cat("Failed to retrieve the page. Status code:", status_code(page), "\n")
    }

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
    
    

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!