Web Scraping Google Scholar in Elixir

Jan 21, 2024 · 6 min read

Google Scholar provides access to extensive academic literature across disciplines. Fortunately, they provide a straightforward web interface we can scrape to leverage their search capabilities from within our own Elixir applications. In this guide, we'll walk through a complete example of scraping Google Scholar search results.

This is the Google Scholar result page we are talking about…

Prerequisites

To follow along, you'll need:

  • Elixir 1.9+
  • Erlang/OTP 22+
  • An editor like VS Code
  • You'll also need to install dependencies so run:

    mix deps.get
    

    This will fetch the HTTPoison, Floki, and String libraries we'll utilize.

    Scraping Overview

    Here's a high-level overview of what our scraper will do:

    1. Send an HTTP request to the Google Scholar search URL
    2. Parse the HTML content of the result
    3. Find result item elements in the DOM
    4. Extract fields like title, URL, authors, abstract
    5. Print out extracted fields

    So essentially we fetch the initial data, then parse and extract the specific pieces we want.

    Setting Up the HTTP Request

    Let's walk through the code one section at a time. We start with some standard Elixir module declarations:

    defmodule ScholarScraper do
      use HTTPoison.Base
    
      @url "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>"
    

    We use HTTPoison.Base to gain networking request capabilities. And define the base Google Scholar URL with our search term "transformers".

    Next we'll define the HTTP headers to spoof a browser visit:

      defp headers do
        %{
          "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
        }
      end
    

    This headers function sets the User-Agent header to masquerade as Chrome browser on Windows. This helps avoid blocks from scrapers.

    Making the Request and Parsing

    Inspecting the code

    You can see that the items are enclosed in a

    element with the class gs_ri

    Now let's implement the function that will fetch the search results:

    def fetch_search_results do
      {:ok, %HTTPoison.Response{status_code: 200, body: body}} = get(@url, [], headers())
    
      case Floki.parse(body) do
        {:ok, document} ->
          search_results = Floki.find(document, "div.gs_ri")
    
          Enum.each(search_results, &extract_and_print(&1))
    
        _ ->
          IO.puts("Failed to parse HTML content.")
      end
    end
    

    Breaking this down:

  • We make a GET request to the @url, passing empty options and our headers
  • Pattern match on a successful response to extract the body
  • Use Floki to try parsing the HTML body
  • On success, we find all
    elements (individual search results)
  • Iterate those results and pass to our extract_and_print function
  • So at this point if all goes well, we have extracted each of the DOM nodes representing an individual search result item.

    Extracting Search Result Fields

    Let's take a closer look at how data extraction works in extract_and_print:

    defp extract_and_print(result) do
      title = Floki.find_one(result, "h3.gs_rt") |> Floki.text() |> String.trim()
      url = Floki.find_one(result, "h3.gs_rt a") |> Floki.attribute("href") |> String.trim()
    
      authors = Floki.find_one(result, "div.gs_a") |> Floki.text() |> String.trim()
    
      abstract = Floki.find_one(result, "div.gs_rs") |> Floki.text() |> String.trim()
    
      IO.puts("Title: #{title}")
      IO.puts("URL: #{url}")
      IO.puts("Authors: #{authors}")
      IO.puts("Abstract: #{abstract}")
    
      IO.puts(String.duplicate("-", 50))
    end
    

    The key things to understand here:

  • We use very specific CSS selectors to match elements
  • find_one gets first matching element from a result node
  • We extract attributes like href and text values
  • The |> operator chains operations elegantly
  • Results are trimmed and formatted
  • So what this does:

  • Find the title element, get text, trim whitespace
  • Get href attribute from URL link, trim
  • Get text from authors section, trim
  • Get text from abstract section, trim
  • Printing this out gives us clean extracted fields for each search result item!

    Running the Scraper

    Finally, to execute everything - simply call the main function:

    ScholarScraper.fetch_search_results()
    

    The full code at this point:

    defmodule ScholarScraper do
      use HTTPoison.Base
      
      @url "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG="
    
      defp headers do
        %{
          "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
        }
      end
    
      def fetch_search_results do
        {:ok, %HTTPoison.Response{status_code: 200, body: body}} = get(@url, [], headers())
    
        case Floki.parse(body) do
          {:ok, document} ->
            search_results = Floki.find(document, "div.gs_ri")
            Enum.each(search_results, &extract_and_print(&1))
          _ ->
            IO.puts("Failed to parse HTML content.")
        end
      end
    
      defp extract_and_print(result) do
        title = Floki.find_one(result, "h3.gs_rt") |> Floki.text() |> String.trim()
        url = Floki.find_one(result, "h3.gs_rt a") |> Floki.attribute("href") |> String.trim()
        authors = Floki.find_one(result, "div.gs_a") |> Floki.text() |> String.trim()
        abstract = Floki.find_one(result, "div.gs_rs") |> Floki.text() |> String.trim()
    
        IO.puts("Title: #{title}")
        IO.puts("URL: #{url}")
        IO.puts("Authors: #{authors}")
        IO.puts("Abstract: #{abstract}")
        IO.puts(String.duplicate("-", 50))
      end
    end
    
    # To run the scraper:
    ScholarScraper.fetch_search_results()

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
    
    

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!