Web Scraping Google Scholar in Rust

Google Scholar is an invaluable resource for researching academic papers and articles across disciplines. The search engine provides detailed information on citations, related works, excerpts, and more for each result. However, there is no official API for programmatically accessing this data.

In this article, we'll use Rust to scrape and extract key fields from Google Scholar search result pages. Specifically, we'll cover:

Sending requests and processing responses

Parsing HTML content with selectors

Extracting titles, URLs, authors, and abstracts

Bringing the scraped data together

We assume some prior experience with Rust.

This is the Google Scholar result page we are talking about…

Let's get started!

Installation

First, make sure Rust is installed on your system:

rustup update

Next, we need the reqwest and select crates for sending HTTP requests and parsing HTML:

cargo add reqwest
cargo add select

Import these crates in your Rust code, along with std::io:

extern crate reqwest;
extern crate select;

Now we're ready to scrape!

Sending the Request

We first define the URL of a Google Scholar search results page to scrape:

let url = "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>";

Note this exact URL string including search parameters - we don't want to modify any literals here.

Next we create a reqwest client and send a GET request:

let client = reqwest::blocking::Client::new();

let res = client.get(url)
    .header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36")
    .send();

We spoof a Chrome User-Agent string to avoid bot detection. Now we can check the response status:

match res {

    Ok(response) => {

        if response.status().is_success() {
             // Parse page
        }

        else {
            // Handle error
        }
    }

    Err(err) => {
        // Network error
    }
}

Statuses like 200 mean success. Anything else indicates a failure to retrieve the page content that needs handling.

Parsing the Page

With a successful response, we access the HTML body:

let body = response.text().unwrap();

We pass this to the select crate's Document to parse:

let document = Document::from(body);

Now document contains a structured representation of elements, attributes, and text on the page. We can query this representation to extract data.

Extracting Title and URL

Inspecting the code

You can see that the items are enclosed in a

element with the class gs_ri

Let's start with the title and URL using this selector:

for node in document.find(Class("gs_ri")) {

}

This finds all elements with CSS class gs_ri - these contain individual search result blocks.

Inside the loop, we extract the

tag within:

let title_elem = node.find(Name("h3")).next();

We map this to the text content:

let title = title_elem.map(|elem| elem.text())
    .unwrap_or("N/A".to_string());

If no h3 tag matched, default to "N/A". Finally, extract href attribute for the URL:

let url = title_elem.and_then(|elem| elem.attr("href"))
    .unwrap_or("N/A".to_string());

And we have title and URL! We would print these fields out.

Extracting Authors

For authors, we select the gs_a element inside each result:

let authors_elem = node.find(Class("gs_a")).next();

Then map to text as before:

let authors = authors_elem.map(|elem| elem.text())
    .unwrap_or("N/A".to_string());

If no author, again default to "N/A".

Getting the Abstract

Finally, for the excerpt/abstract:

let abstract_elem = node.find(Class("gs_rs")).next();

let abstract = abstract_elem.map(|elem| elem.text())
    .unwrap_or("N/A".to_string());

We target the gs_rs element and extract text similarly.

And that covers scraping all the key fields from a Google Scholar search result!

Bringing the Data Together

Below the extraction logic, we print out each field value for convenience:

println!("Title: {}", title);
println!("URL: {}", url);
println!("Authors: {}", authors);
println!("Abstract: {}", abstract);

The full sequence extracts, processes, and prints the title, URL, authors list, and abstract snippet for each of the top 10 search result blocks from Google Scholar with Rust!

While basic, this demonstrates core techniques for practical web scraping like:

Sending GET requests

Handling responses

Parsing HTML

Using precise selectors and data mappings

Defaulting missing values

Bringing scraped data together

There is certainly more to cover, but this foundations should enable you to get started with scraping projects of your own!

Full Code

For easy reference, here is the complete Rust code covered in this article:


extern crate reqwest;
extern crate select;

use select::document::Document;
use select::node::Node;
use select::predicate::{Class, Name};

fn main() {
    // Define the URL of the Google Scholar search page
    let url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=";

    // Send a GET request to the URL
    let client = reqwest::blocking::Client::new();
    let res = client.get(url)
        .header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36")
        .send();

    // Check if the request was successful
    match res {
        Ok(response) => {
            if response.status().is_success() {
                // Parse the HTML content of the page
                let body = response.text().unwrap();
                let document = Document::from(body);

                // Find all the search result blocks with class "gs_ri"
                for node in document.find(Class("gs_ri")) {
                    // Extract the title and URL
                    let title_elem = node.find(Name("h3")).next();
                    let title = title_elem.map(|elem| elem.text()).unwrap_or("N/A".to_string());
                    let url = title_elem.and_then(|elem| elem.attr("href")).unwrap_or("N/A".to_string());

                    // Extract the authors and publication details
                    let authors_elem = node.find(Class("gs_a")).next();
                    let authors = authors_elem.map(|elem| elem.text()).unwrap_or("N/A".to_string());

                    // Extract the abstract or description
                    let abstract_elem = node.find(Class("gs_rs")).next();
                    let abstract = abstract_elem.map(|elem| elem.text()).unwrap_or("N/A".to_string());

                    // Print the extracted information
                    println!("Title: {}", title);
                    println!("URL: {}", url);
                    println!("Authors: {}", authors);
                    println!("Abstract: {}", abstract);
                    println!("{}", "-".repeat(50)); // Separating search results
                }
            } else {
                println!("Failed to retrieve the page. Status code: {}", response.status());
            }
        }
        Err(err) => {
            println!("Error: {}", err);
        }
    }
}

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"

We have a running offer of 1000 API calls completely free. Register and get your free API Key.

Web Scraping Google Scholar in Rust

Installation

Sending the Request

Parsing the Page

Extracting Title and URL

Extracting Authors

Getting the Abstract

Bringing the Data Together

Full Code

Browse by language:

The easiest way to do Web Scraping

Web Scraping Google Scholar in Rust

Installation

Sending the Request

Parsing the Page

Extracting Title and URL

Extracting Authors

Getting the Abstract

Bringing the Data Together

Full Code

The easiest way to do Web Scraping

Don't leave just yet!