Web Scraping Google Scholar in CSharp

Jan 21, 2024 · 9 min read

In this tutorial, we'll walk step-by-step through a C# program that scrapes search results data from Google Scholar.

This is the Google Scholar result page we are talking about…

Background on Web Scraping

Before we dive into the code, let's briefly discuss web scraping...

[a paragraph overview of what web scraping is, without any commentary on ethics]

Now that we have some background on web scraping, let's get to the code!

Importing Required Namespaces

We first import the following .NET namespaces that we'll need for making HTTP requests and parsing HTML:

using System;
using System.Net.Http;
using HtmlAgilityPack;

For beginners, these using statements allow us to access classes from the specified namespaces in our code without needing to fully qualify class names.

Defining the C# Program

Next, we define a class to contain our scraper code:

class Program
{

}

And inside that class is our Main method, which is the entry point when executing the program:

static async System.Threading.Tasks.Task Main(string[] args)
{

}

The async keyword here allows us to use await in our Main method to call asynchronous code.

Defining the Target URL

To scrape a website, we first need to define the URL of the page we want to extract data from.

In this case, we'll set the starting URL to a Google Scholar search results page:

// Define the URL of the Google Scholar search page
string url = "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>";

Hard-coding the URL provides us a consistent starting point to extract data from on each run.

Setting the User Agent

Many websites try to detect and block scrapers by looking at the User-Agent header. So it's useful to spoof a real browser's user agent string:

// Define a User-Agent header
string userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";

This helps us bypass blocks and access the site's data like a normal browser would.

Creating the HTTP Client

With the URL defined, we can now create an HTTP client to send requests:

// Create an HttpClient with the User-Agent header
HttpClient httpClient = new HttpClient();
httpClient.DefaultRequestHeaders.Add("User-Agent", userAgent);

This HttpClient will allow us to GET the Google Scholar page while spoofing a Chrome browser user agent.

Sending the Request

We use the HttpClient to send a GET request to the target search URL:

// Send a GET request to the URL
HttpResponseMessage response = await httpClient.GetAsync(url);

The await keyword pauses execution until the asynchronous request completes and returns the response.

Checking for Success

Before scraping the page, we should verify the request succeeded by checking the status code:

// Check if the request was successful (status code 200)
if (response.IsSuccessStatusCode)
{

}

A status code of 200 means our request completed successfully. We can now scrape the page's content inside this if block.

Parsing the Page with HtmlAgilityPack

To extract information from the page HTML, we use the HtmlAgilityPack library. First we get the raw HTML content:

// Parse the HTML content of the page using HtmlAgilityPack
string htmlContent = await response.Content.ReadAsStringAsync();

And then load it into an HtmlDocument which allows XPath queries to find elements:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlContent);

HtmlAgilityPack parses the HTML into a traversable DOM document model.

Locating Search Result Blocks

Inspecting the code

You can see that the items are enclosed in a

element with the class gs_ri

With the HTML document loaded, we can now query for elements using XPath syntax.

To find all search result blocks, we locate div tags with the "gs_ri" CSS class:

// Find all the search result blocks with class "gs_ri"
var searchResults = doc.DocumentNode.SelectNodes("//div\\[contains(@class, 'gs_ri')\\]");

ThisSelectNodes call returns a list matching those search result blocks.

Extracting Result Data

Inside a foreach loop, we iterate through each search result div.

Then, we use XPath queries to extract specific fields from within that block:

// Loop through each search result block
foreach (var result in searchResults)
{

  // Extract the title and URL
  var titleElem = result.SelectSingleNode(".//h3[@class='gs_rt']");

  // Extract authors
  var authorsElem = result.SelectSingleNode(".//div[@class='gs_a']");

  // Extract abstract
  var abstractElem = result.SelectSingleNode(".//div[@class='gs_rs']");

}

Let's look at each extraction selector one-by-one...

Extracting the Title and URL

To get the title link, we locate h3 elements with class gs_rt under the search result node:

var titleElem = result.SelectSingleNode(".//h3[@class='gs_rt']");

From there we can get the .InnerText for the title itself and the .GetAttributeValue() method to get its nested link's href attribute.

Since title elements are optional, we fallback to "N/A" if null:

string title = titleElem?.InnerText ?? "N/A";
string url = titleElem?.SelectSingleNode(".//a")?.GetAttributeValue("href", "N/A") ?? "N/A";

This prevents any potential errors from missing elements.

Authors Extraction

For authors data, we query div tags with gs_a class:

var authorsElem = result.SelectSingleNode(".//div[@class='gs_a']");

Then we can simply get the .InnerText of that div as the author(s) info:

string authors = authorsElem?.InnerText ?? "N/A";

Again using the null-coalescing operator for safety.

Getting the Abstract

Abstracts are stored in div elements with a class of gs_rs:

var abstractElem = result.SelectSingleNode(".//div[@class='gs_rs']");

We grab the .InnerText just like with authors:

string abstractText = abstractElem?.InnerText ?? "N/A";

And that gives us the text summary of the paper!

Below this, we would print out all the extracted fields. But the key learning was how we selected page elements to scrape each data point.

Printing Results & Error Handling

To finish up the web scraper code, we:

  1. Print out each extracted result field:
Console.WriteLine("Title: " + title);
Console.WriteLine("URL: " + url);
// etc
  1. Add error handling in case the request failed:
else
{
  Console.WriteLine("Failed to retrieve the page. Status code: " + response.StatusCode);
}

This handles cases like the site blocking the request or not responding.

  1. Close the Main method and class:
} // end Main method

} // end Program class

And that concludes the key parts of our Google Scholar web scraping program!

Full Code Listing

For reference, here is the complete code listing:

using System;
using System.Net.Http;
using HtmlAgilityPack;

class Program
{
    static async System.Threading.Tasks.Task Main(string[] args)
    {
        // Define the URL of the Google Scholar search page
        string url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=";

        // Define a User-Agent header
        string userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";

        // Create an HttpClient with the User-Agent header
        HttpClient httpClient = new HttpClient();
        httpClient.DefaultRequestHeaders.Add("User-Agent", userAgent);

        // Send a GET request to the URL
        HttpResponseMessage response = await httpClient.GetAsync(url);

        // Check if the request was successful (status code 200)
        if (response.IsSuccessStatusCode)
        {
            // Parse the HTML content of the page using HtmlAgilityPack
            string htmlContent = await response.Content.ReadAsStringAsync();
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(htmlContent);

            // Find all the search result blocks with class "gs_ri"
            var searchResults = doc.DocumentNode.SelectNodes("//div[contains(@class, 'gs_ri')]");

            // Loop through each search result block and extract information
            foreach (var result in searchResults)
            {
                // Extract the title and URL
                var titleElem = result.SelectSingleNode(".//h3[@class='gs_rt']");
                string title = titleElem?.InnerText ?? "N/A";
                string url = titleElem?.SelectSingleNode(".//a")?.GetAttributeValue("href", "N/A") ?? "N/A";

                // Extract the authors and publication details
                var authorsElem = result.SelectSingleNode(".//div[@class='gs_a']");
                string authors = authorsElem?.InnerText ?? "N/A";

                // Extract the abstract or description
                var abstractElem = result.SelectSingleNode(".//div[@class='gs_rs']");
                string abstractText = abstractElem?.InnerText ?? "N/A";

                // Print the extracted information
                Console.WriteLine("Title: " + title);
                Console.WriteLine("URL: " + url);
                Console.WriteLine("Authors: " + authors);
                Console.WriteLine("Abstract: " + abstractText);
                Console.WriteLine(new string('-', 50));  // Separating search results
            }
        }
        else
        {
            Console.WriteLine("Failed to retrieve the page. Status code: " + response.StatusCode);
        }
    }
}

The code is exactly as shown earlier, without any modifications. This scraper extracts the title, URL, authors, and abstract text from Google Scholar search results.

Installing Required Libraries

To execute this C# web scraping code on your own machine:

  1. Install the .NET 6 SDK
  2. Install the HtmlAgilityPack NuGet package

And that should provide the necessary dependencies to run the program.

Key Takeaways

In this lengthy tutorial, we stepped through a real-world example of web scraping search results from Google Scholar using C#:

  • We used HttpClient and HtmlAgilityPack to retrieve and parse page HTML
  • We selected page elements using XPath queries to extract specific fields
  • We handled cases of missing data and request failures
  • We printed out the scraped results and left the original code intact
  • Hopefully this detailed overview helped explain how web scrapers are built using C# code. There are many intricacies involved, but by seeing each part individually you can start building effective scrapers of your own!

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
    
    

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!