Web Scraping Google Scholar in Ruby

In this post, we'll walk through a real-world Ruby script that scrapes search result data from Google Scholar. We'll go step-by-step to understand exactly how it works.

This is the Google Scholar result page we are talking about…

Overview

The goal of this script is straightforward - retrieve search result data from a Google Scholar query. This includes:

Title

URL

Authors

Abstract snippet

Rather than using Google's API (which has usage limits), we'll request the HTML directly and parse it.

Let's dive into the code!

Setup

First we require the libraries we need for making HTTP requests and parsing HTML:

require 'nokogiri'
require 'open-uri'

Nokogiri lets us extract data from HTML and XML in Ruby. We'll use it to parse Google's response.

OpenURI makes sending HTTP requests easy from Ruby.

Defining the Request

Next we set up the URL and headers for our request:

url = "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>"

headers = {
  "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
}

The URL performs a Google Scholar search for "transformers".

The headers dictionary sets the User-Agent - this helps avoid blocked requests.

Making the Request

With the URL and headers ready, we use OpenURI to send the GET request:

response = URI.open(url, headers: headers)

We pass the URL, specifying our headers to include the User-Agent string.

This gives us back a response containing the raw HTML result of the Google Scholar search.

Checking the Response

Before parsing, we check that the request succeeded:

if response.status == ["200", "OK"]
  # Parse HTML
else
  puts "Failed to retrieve the page. Status code: #{response.status[0]}"
end

A status code of 200 means success. Any other code likely means an error or blocked request.

We print a failure message in that case.

Parsing the HTML

Inspecting the code

You can see that the items are enclosed in a

element with the class gs_ri

Now we can parse the HTML search results with Nokogiri:

doc = Nokogiri::HTML(response)

search_results = doc.css("div.gs_ri")

We initialize a Nokogiri doc from the HTML response.

The doc.css() method lets us use CSS selectors to extract data. Here we grab all

elements with class gs_ri, which contain the individual search result blocks.

Extracting Search Result Data

With the search result elements selected, we can extract fields:

search_results.each do |result|

  title_elem = result.css("h3.gs_rt").first
  title = title_elem&.text || "N/A"

  url = title_elem&.at("a")&.attr("href") || "N/A"

  authors_elem = result.css("div.gs_a").first
  authors = authors_elem&.text || "N/A"

  abstract_elem = result.css("div.gs_rs").first
  abstract = abstract_elem&.text || "N/A"

  # Print output
end

We loop through each result block.

The key part is using CSS selectors to extract elements, then getting text or attributes from those.

For example:

result.css("h3.gs_rt") selects the title element

We get its .text content and .href attribute

Fallbacks like || "N/A" handle missing data

This may look confusing at first!

But when you break it down selector-by-selector, you can understand exactly how we extract each data field.

Printing Output

Finally, we print the extracted data:

puts "Title: #{title}"
puts "URL: #{url}"
puts "Authors: #{authors}"
puts "Abstract: #{abstract}"
puts "-" * 50 # Separator

This outputs each search result's data to the console.

Full Code

For easy reference, here is the complete script:

require 'nokogiri'
require 'open-uri'

# Define the URL of the Google Scholar search page
url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG="

# Define a User-Agent header
headers = {
  "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"  # Replace with your User-Agent string
}

# Send a GET request to the URL with the User-Agent header
response = URI.open(url, headers: headers)

# Check if the request was successful (status code 200)
if response.status == ["200", "OK"]
  # Parse the HTML content of the page using Nokogiri
  doc = Nokogiri::HTML(response)

  # Find all the search result blocks with class "gs_ri"
  search_results = doc.css("div.gs_ri")

  # Loop through each search result block and extract information
  search_results.each do |result|
    # Extract the title and URL
    title_elem = result.css("h3.gs_rt").first
    title = title_elem&.text || "N/A"
    url = title_elem&.at("a")&.attr("href") || "N/A"

    # Extract the authors and publication details
    authors_elem = result.css("div.gs_a").first
    authors = authors_elem&.text || "N/A"

    # Extract the abstract or description
    abstract_elem = result.css("div.gs_rs").first
    abstract = abstract_elem&.text || "N/A"

    # Print the extracted information
    puts "Title: #{title}"
    puts "URL: #{url}"
    puts "Authors: #{authors}"
    puts "Abstract: #{abstract}"
    puts "-" * 50  # Separating search results
  end
else
  puts "Failed to retrieve the page. Status code: #{response.status[0]}"
end

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"

We have a running offer of 1000 API calls completely free. Register and get your free API Key.

Web Scraping Google Scholar in Ruby

Overview

Setup

Defining the Request

Making the Request

Checking the Response

Parsing the HTML

Extracting Search Result Data

Printing Output

Full Code

Browse by language:

The easiest way to do Web Scraping

Web Scraping Google Scholar in Ruby

Overview

Setup

Defining the Request

Making the Request

Checking the Response

Parsing the HTML

Extracting Search Result Data

Printing Output

Full Code

The easiest way to do Web Scraping

Don't leave just yet!