Web Scraping Google Scholar in Ruby

Jan 21, 2024 · 6 min read

In this post, we'll walk through a real-world Ruby script that scrapes search result data from Google Scholar. We'll go step-by-step to understand exactly how it works.

This is the Google Scholar result page we are talking about…

Overview

The goal of this script is straightforward - retrieve search result data from a Google Scholar query. This includes:

  • Title
  • URL
  • Authors
  • Abstract snippet
  • Rather than using Google's API (which has usage limits), we'll request the HTML directly and parse it.

    Let's dive into the code!

    Setup

    First we require the libraries we need for making HTTP requests and parsing HTML:

    require 'nokogiri'
    require 'open-uri'
    

    Nokogiri lets us extract data from HTML and XML in Ruby. We'll use it to parse Google's response.

    OpenURI makes sending HTTP requests easy from Ruby.

    Defining the Request

    Next we set up the URL and headers for our request:

    url = "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>"
    
    headers = {
      "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    }
    

    The URL performs a Google Scholar search for "transformers".

    The headers dictionary sets the User-Agent - this helps avoid blocked requests.

    Making the Request

    With the URL and headers ready, we use OpenURI to send the GET request:

    response = URI.open(url, headers: headers)
    

    We pass the URL, specifying our headers to include the User-Agent string.

    This gives us back a response containing the raw HTML result of the Google Scholar search.

    Checking the Response

    Before parsing, we check that the request succeeded:

    if response.status == ["200", "OK"]
      # Parse HTML
    else
      puts "Failed to retrieve the page. Status code: #{response.status[0]}"
    end
    

    A status code of 200 means success. Any other code likely means an error or blocked request.

    We print a failure message in that case.

    Parsing the HTML

    Inspecting the code

    You can see that the items are enclosed in a

    element with the class gs_ri

    Now we can parse the HTML search results with Nokogiri:

    doc = Nokogiri::HTML(response)
    
    search_results = doc.css("div.gs_ri")
    

    We initialize a Nokogiri doc from the HTML response.

    The doc.css() method lets us use CSS selectors to extract data. Here we grab all

    elements with class gs_ri, which contain the individual search result blocks.

    Extracting Search Result Data

    With the search result elements selected, we can extract fields:

    search_results.each do |result|
    
      title_elem = result.css("h3.gs_rt").first
      title = title_elem&.text || "N/A"
    
      url = title_elem&.at("a")&.attr("href") || "N/A"
    
      authors_elem = result.css("div.gs_a").first
      authors = authors_elem&.text || "N/A"
    
      abstract_elem = result.css("div.gs_rs").first
      abstract = abstract_elem&.text || "N/A"
    
      # Print output
    end
    

    We loop through each result block.

    The key part is using CSS selectors to extract elements, then getting text or attributes from those.

    For example:

  • result.css("h3.gs_rt") selects the title element
  • We get its .text content and .href attribute
  • Fallbacks like || "N/A" handle missing data
  • This may look confusing at first!

    But when you break it down selector-by-selector, you can understand exactly how we extract each data field.

    Printing Output

    Finally, we print the extracted data:

    puts "Title: #{title}"
    puts "URL: #{url}"
    puts "Authors: #{authors}"
    puts "Abstract: #{abstract}"
    puts "-" * 50 # Separator
    

    This outputs each search result's data to the console.

    Full Code

    For easy reference, here is the complete script:

    require 'nokogiri'
    require 'open-uri'
    
    # Define the URL of the Google Scholar search page
    url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG="
    
    # Define a User-Agent header
    headers = {
      "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"  # Replace with your User-Agent string
    }
    
    # Send a GET request to the URL with the User-Agent header
    response = URI.open(url, headers: headers)
    
    # Check if the request was successful (status code 200)
    if response.status == ["200", "OK"]
      # Parse the HTML content of the page using Nokogiri
      doc = Nokogiri::HTML(response)
    
      # Find all the search result blocks with class "gs_ri"
      search_results = doc.css("div.gs_ri")
    
      # Loop through each search result block and extract information
      search_results.each do |result|
        # Extract the title and URL
        title_elem = result.css("h3.gs_rt").first
        title = title_elem&.text || "N/A"
        url = title_elem&.at("a")&.attr("href") || "N/A"
    
        # Extract the authors and publication details
        authors_elem = result.css("div.gs_a").first
        authors = authors_elem&.text || "N/A"
    
        # Extract the abstract or description
        abstract_elem = result.css("div.gs_rs").first
        abstract = abstract_elem&.text || "N/A"
    
        # Print the extracted information
        puts "Title: #{title}"
        puts "URL: #{url}"
        puts "Authors: #{authors}"
        puts "Abstract: #{abstract}"
        puts "-" * 50  # Separating search results
      end
    else
      puts "Failed to retrieve the page. Status code: #{response.status[0]}"
    end

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
    
    

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: