Web Scraping Google Scholar in Elixir

Google Scholar provides access to extensive academic literature across disciplines. Fortunately, they provide a straightforward web interface we can scrape to leverage their search capabilities from within our own Elixir applications. In this guide, we'll walk through a complete example of scraping Google Scholar search results.

This is the Google Scholar result page we are talking about…

Prerequisites

To follow along, you'll need:

Elixir 1.9+

Erlang/OTP 22+

An editor like VS Code

You'll also need to install dependencies so run:

mix deps.get

This will fetch the HTTPoison, Floki, and String libraries we'll utilize.

Scraping Overview

Here's a high-level overview of what our scraper will do:

Send an HTTP request to the Google Scholar search URL
Parse the HTML content of the result
Find result item elements in the DOM
Extract fields like title, URL, authors, abstract
Print out extracted fields

So essentially we fetch the initial data, then parse and extract the specific pieces we want.

Setting Up the HTTP Request

Let's walk through the code one section at a time. We start with some standard Elixir module declarations:

defmodule ScholarScraper do
  use HTTPoison.Base

  @url "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>"

We use HTTPoison.Base to gain networking request capabilities. And define the base Google Scholar URL with our search term "transformers".

Next we'll define the HTTP headers to spoof a browser visit:

  defp headers do
    %{
      "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    }
  end

This headers function sets the User-Agent header to masquerade as Chrome browser on Windows. This helps avoid blocks from scrapers.

Making the Request and Parsing

Inspecting the code

You can see that the items are enclosed in a

element with the class gs_ri

Now let's implement the function that will fetch the search results:

def fetch_search_results do
  {:ok, %HTTPoison.Response{status_code: 200, body: body}} = get(@url, [], headers())

  case Floki.parse(body) do
    {:ok, document} ->
      search_results = Floki.find(document, "div.gs_ri")

      Enum.each(search_results, &extract_and_print(&1))

    _ ->
      IO.puts("Failed to parse HTML content.")
  end
end

Breaking this down:

We make a GET request to the @url, passing empty options and our headers

Pattern match on a successful response to extract the body

Use Floki to try parsing the HTML body

On success, we find all

elements (individual search results)

Iterate those results and pass to our extract_and_print function

So at this point if all goes well, we have extracted each of the DOM nodes representing an individual search result item.

Extracting Search Result Fields

Let's take a closer look at how data extraction works in extract_and_print:

defp extract_and_print(result) do
  title = Floki.find_one(result, "h3.gs_rt") |> Floki.text() |> String.trim()
  url = Floki.find_one(result, "h3.gs_rt a") |> Floki.attribute("href") |> String.trim()

  authors = Floki.find_one(result, "div.gs_a") |> Floki.text() |> String.trim()

  abstract = Floki.find_one(result, "div.gs_rs") |> Floki.text() |> String.trim()

  IO.puts("Title: #{title}")
  IO.puts("URL: #{url}")
  IO.puts("Authors: #{authors}")
  IO.puts("Abstract: #{abstract}")

  IO.puts(String.duplicate("-", 50))
end

The key things to understand here:

We use very specific CSS selectors to match elements

find_one gets first matching element from a result node

We extract attributes like href and text values

The |> operator chains operations elegantly

Results are trimmed and formatted

So what this does:

Find the title element, get text, trim whitespace

Get href attribute from URL link, trim

Get text from authors section, trim

Get text from abstract section, trim

Printing this out gives us clean extracted fields for each search result item!

Running the Scraper

Finally, to execute everything - simply call the main function:

ScholarScraper.fetch_search_results()

The full code at this point:

defmodule ScholarScraper do
  use HTTPoison.Base
  
  @url "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG="

  defp headers do
    %{
      "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    }
  end

  def fetch_search_results do
    {:ok, %HTTPoison.Response{status_code: 200, body: body}} = get(@url, [], headers())

    case Floki.parse(body) do
      {:ok, document} ->
        search_results = Floki.find(document, "div.gs_ri")
        Enum.each(search_results, &extract_and_print(&1))
      _ ->
        IO.puts("Failed to parse HTML content.")
    end
  end

  defp extract_and_print(result) do
    title = Floki.find_one(result, "h3.gs_rt") |> Floki.text() |> String.trim()
    url = Floki.find_one(result, "h3.gs_rt a") |> Floki.attribute("href") |> String.trim()
    authors = Floki.find_one(result, "div.gs_a") |> Floki.text() |> String.trim()
    abstract = Floki.find_one(result, "div.gs_rs") |> Floki.text() |> String.trim()

    IO.puts("Title: #{title}")
    IO.puts("URL: #{url}")
    IO.puts("Authors: #{authors}")
    IO.puts("Abstract: #{abstract}")
    IO.puts(String.duplicate("-", 50))
  end
end

# To run the scraper:
ScholarScraper.fetch_search_results()

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"

We have a running offer of 1000 API calls completely free. Register and get your free API Key.

Web Scraping Google Scholar in Elixir

Prerequisites

Scraping Overview

Setting Up the HTTP Request

Making the Request and Parsing

Extracting Search Result Fields

Running the Scraper

Browse by language:

The easiest way to do Web Scraping

Web Scraping Google Scholar in Elixir

Prerequisites

Scraping Overview

Setting Up the HTTP Request

Making the Request and Parsing

Extracting Search Result Fields

Running the Scraper

The easiest way to do Web Scraping

Don't leave just yet!