Web Scraping Google Scholar in Rust

Jan 21, 2024 · 7 min read

Google Scholar is an invaluable resource for researching academic papers and articles across disciplines. The search engine provides detailed information on citations, related works, excerpts, and more for each result. However, there is no official API for programmatically accessing this data.

In this article, we'll use Rust to scrape and extract key fields from Google Scholar search result pages. Specifically, we'll cover:

  • Sending requests and processing responses
  • Parsing HTML content with selectors
  • Extracting titles, URLs, authors, and abstracts
  • Bringing the scraped data together
  • We assume some prior experience with Rust.

    This is the Google Scholar result page we are talking about…

    Let's get started!

    Installation

    First, make sure Rust is installed on your system:

    rustup update
    

    Next, we need the reqwest and select crates for sending HTTP requests and parsing HTML:

    cargo add reqwest
    cargo add select
    

    Import these crates in your Rust code, along with std::io:

    extern crate reqwest;
    extern crate select;
    

    Now we're ready to scrape!

    Sending the Request

    We first define the URL of a Google Scholar search results page to scrape:

    let url = "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>";
    

    Note this exact URL string including search parameters - we don't want to modify any literals here.

    Next we create a reqwest client and send a GET request:

    let client = reqwest::blocking::Client::new();
    
    let res = client.get(url)
        .header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36")
        .send();
    

    We spoof a Chrome User-Agent string to avoid bot detection. Now we can check the response status:

    match res {
    
        Ok(response) => {
    
            if response.status().is_success() {
                 // Parse page
            }
    
            else {
                // Handle error
            }
        }
    
        Err(err) => {
            // Network error
        }
    }
    

    Statuses like 200 mean success. Anything else indicates a failure to retrieve the page content that needs handling.

    Parsing the Page

    With a successful response, we access the HTML body:

    let body = response.text().unwrap();
    

    We pass this to the select crate's Document to parse:

    let document = Document::from(body);
    

    Now document contains a structured representation of elements, attributes, and text on the page. We can query this representation to extract data.

    Extracting Title and URL

    Inspecting the code

    You can see that the items are enclosed in a

    element with the class gs_ri

    Let's start with the title and URL using this selector:

    for node in document.find(Class("gs_ri")) {
    
    }
    

    This finds all elements with CSS class gs_ri - these contain individual search result blocks.

    Inside the loop, we extract the

    tag within:

    let title_elem = node.find(Name("h3")).next();
    

    We map this to the text content:

    let title = title_elem.map(|elem| elem.text())
        .unwrap_or("N/A".to_string());
    

    If no h3 tag matched, default to "N/A". Finally, extract href attribute for the URL:

    let url = title_elem.and_then(|elem| elem.attr("href"))
        .unwrap_or("N/A".to_string());
    

    And we have title and URL! We would print these fields out.

    Extracting Authors

    For authors, we select the gs_a element inside each result:

    let authors_elem = node.find(Class("gs_a")).next();
    

    Then map to text as before:

    let authors = authors_elem.map(|elem| elem.text())
        .unwrap_or("N/A".to_string());
    

    If no author, again default to "N/A".

    Getting the Abstract

    Finally, for the excerpt/abstract:

    let abstract_elem = node.find(Class("gs_rs")).next();
    
    let abstract = abstract_elem.map(|elem| elem.text())
        .unwrap_or("N/A".to_string());
    

    We target the gs_rs element and extract text similarly.

    And that covers scraping all the key fields from a Google Scholar search result!

    Bringing the Data Together

    Below the extraction logic, we print out each field value for convenience:

    println!("Title: {}", title);
    println!("URL: {}", url);
    println!("Authors: {}", authors);
    println!("Abstract: {}", abstract);
    

    The full sequence extracts, processes, and prints the title, URL, authors list, and abstract snippet for each of the top 10 search result blocks from Google Scholar with Rust!

    While basic, this demonstrates core techniques for practical web scraping like:

  • Sending GET requests
  • Handling responses
  • Parsing HTML
  • Using precise selectors and data mappings
  • Defaulting missing values
  • Bringing scraped data together
  • There is certainly more to cover, but this foundations should enable you to get started with scraping projects of your own!

    Full Code

    For easy reference, here is the complete Rust code covered in this article:

    
    extern crate reqwest;
    extern crate select;
    
    use select::document::Document;
    use select::node::Node;
    use select::predicate::{Class, Name};
    
    fn main() {
        // Define the URL of the Google Scholar search page
        let url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=";
    
        // Send a GET request to the URL
        let client = reqwest::blocking::Client::new();
        let res = client.get(url)
            .header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36")
            .send();
    
        // Check if the request was successful
        match res {
            Ok(response) => {
                if response.status().is_success() {
                    // Parse the HTML content of the page
                    let body = response.text().unwrap();
                    let document = Document::from(body);
    
                    // Find all the search result blocks with class "gs_ri"
                    for node in document.find(Class("gs_ri")) {
                        // Extract the title and URL
                        let title_elem = node.find(Name("h3")).next();
                        let title = title_elem.map(|elem| elem.text()).unwrap_or("N/A".to_string());
                        let url = title_elem.and_then(|elem| elem.attr("href")).unwrap_or("N/A".to_string());
    
                        // Extract the authors and publication details
                        let authors_elem = node.find(Class("gs_a")).next();
                        let authors = authors_elem.map(|elem| elem.text()).unwrap_or("N/A".to_string());
    
                        // Extract the abstract or description
                        let abstract_elem = node.find(Class("gs_rs")).next();
                        let abstract = abstract_elem.map(|elem| elem.text()).unwrap_or("N/A".to_string());
    
                        // Print the extracted information
                        println!("Title: {}", title);
                        println!("URL: {}", url);
                        println!("Authors: {}", authors);
                        println!("Abstract: {}", abstract);
                        println!("{}", "-".repeat(50)); // Separating search results
                    }
                } else {
                    println!("Failed to retrieve the page. Status code: {}", response.status());
                }
            }
            Err(err) => {
                println!("Error: {}", err);
            }
        }
    }

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
    
    

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: