Web Scraping Wikipedia with CSharp

Dec 6, 2023 · 8 min read

Introduction

In this article, we will take you through the process of scraping data from Wikipedia using C# and the HtmlAgilityPack library. Web scraping is a powerful technique for extracting information from websites, and it can be a valuable skill for data collection, analysis, and automation.

Scenario: Imagine you're working on a research project or just curious about the list of U.S. Presidents, and you want to collect data from Wikipedia to create your own dataset. We'll guide you through each step, providing explanations, tips, and tricks along the way.

This is the table we are talking about

Plan of Action

Step 1: Define the Goal

  • Reasoning: Before diving into code, it's essential to know what you want to achieve. In our case, we want to scrape the list of U.S. Presidents from the Wikipedia page and display their data.
  • Step 2: Set Up Your Environment

  • Reasoning: We need to ensure you have the necessary tools installed and ready to use.
  • Installation: You'll need Visual Studio (or any C# development environment) and the HtmlAgilityPack library. You can install HtmlAgilityPack via NuGet Package Manager.
  • Step 3: Choose the Website to Scrape

  • Reasoning: We're interested in scraping the Wikipedia page that lists U.S. Presidents. In our code, the URL is https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States.
  • Step 4: Simulate a Browser Request

  • Reasoning: Many websites block or limit access to automated scrapers, so we'll simulate a browser request by setting a user-agent header.
  • Code: We'll define a user-agent header in our C# code to make our request look like it's coming from a web browser.
  • Step 5: Load the HTML Content

  • Reasoning: To scrape data, we first need to retrieve the HTML content of the webpage.
  • Code: We'll use HtmlAgilityPack to load the HTML content from the URL.
  • Step 6: Find the Data

  • Reasoning: The data we want is in an HTML table on the page. We need to locate this table.
  • Code: We'll use XPath to find the table with a specific class name (wikitable sortable).
  • Step 7: Extract and Store the Data

  • Reasoning: We'll extract the data from the table and store it in a structured format for later use.
  • Code: We'll iterate through the table rows and columns, extracting the data and organizing it into lists.
  • Step 8: Display the Scraped Data

  • Reasoning: The final step is to display the scraped data to the user.
  • Code: We'll print the data to the console.
  • Detailed Instructions

    Now, let's dive into the details of each step.

    Step 2: Set Up Your Environment

    Before we start coding, make sure you have the following in place:

  • Visual Studio (or another C# development environment)
  • HtmlAgilityPack library (install it via NuGet Package Manager)
  • Step 4: Simulate a Browser Request

    string userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36";
    

    In our code, we define a user-agent header. This header tells the web server that our request is coming from a web browser, making it less likely to be blocked.

    Step 5: Load the HTML Content

    var web = new HtmlWeb
    {
        UserAgent = userAgent
    };
    var doc = web.Load(url);
    

    We use HtmlAgilityPack's HtmlWeb to load the HTML content from the specified URL.

    Step 6: Find the Data

    Inspecting the page

    When we inspect the page we can see that the table has a class called wikitable and sortable

    var table = doc.DocumentNode.SelectSingleNode("//table[@class='wikitable sortable']");
    

    Here, we use XPath to locate the table with the class name 'wikitable sortable' on the page.

    Step 7: Extract and Store the Data

    var data = new List<List<string>>();
    
    foreach (var row in table.SelectNodes("tr").Skip(1))
    {
        var columns = row.SelectNodes("td | th");
    
        if (columns != null)
        {
            var row_data = new List<string>();
            foreach (var col in columns)
            {
                row_data.Add(col.InnerText.Trim());
            }
            data.Add(row_data);
        }
    }
    

    We iterate through the table rows and columns, extracting the data and storing it in a structured format (a list of lists).

    Step 8: Display the Scraped Data

    foreach (var president_data in data)
    {
        Console.WriteLine("President Data:");
        Console.WriteLine("Number: " + president_data[0]);
        Console.WriteLine("Name: " + president_data[2]);
        Console.WriteLine("Term: " + president_data[3]);
        Console.WriteLine("Party: " + president_data[5]);
        Console.WriteLine("Election: " + president_data[6]);
        Console.WriteLine("Vice President: " + president_data[7]);
        Console.WriteLine();
    }
    

    We print the scraped data for all presidents to the console.

    Practical Considerations and Challenges

  • Website Structure Changes: Websites frequently update their structure, which may break your scraping code. Regular maintenance is needed.
  • Legal and Ethical Considerations: Always respect a website's terms of service and robots.txt file. Scraping can be legally and ethically sensitive.
  • Data Cleaning: Scraped data may require cleaning and formatting, depending on your use case.
  • Next Steps

  • Extend this code to scrape data from other Wikipedia pages or different websites.
  • Learn more about XPath to fine-tune your element selection.
  • Explore data storage options (e.g., databases) for saving scraped data.
  • Conclusion

    Web scraping with C# and HtmlAgilityPack is a valuable skill for data enthusiasts and researchers. By following this step-by-step guide, you've learned how to extract data from Wikipedia. Remember to always scrape responsibly and ethically, respecting website policies and legal regulations.

    Full Code: Here's the complete code for your reference:

    using System;
    using HtmlAgilityPack;
    
    namespace WikipediaScraper
    {
        class Program
        {
            static void Main(string[] args)
            {
                // Define the URL of the Wikipedia page
                string url = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States";
    
                // Define a user-agent header to simulate a browser request
                string userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36";
    
                // Create an HtmlWeb instance with the user-agent header
                var web = new HtmlWeb
                {
                    UserAgent = userAgent
                };
    
                // Load the HTML content of the page
                var doc = web.Load(url);
    
                // Check if the request was successful
                if (doc != null)
                {
                    // Find the table with the specified class name
                    var table = doc.DocumentNode.SelectSingleNode("//table[@class='wikitable sortable']");
    
                    if (table != null)
                    {
                        // Initialize empty lists to store the table data
                        var data = new System.Collections.Generic.List<System.Collections.Generic.List<string>>();
    
                        // Iterate through the rows of the table
                        foreach (var row in table.SelectNodes("tr").Skip(1)) // Skip the header row
                        {
                            var columns = row.SelectNodes("td | th");
    
                            if (columns != null)
                            {
                                // Extract data from each column and append it to the data list
                                var row_data = new System.Collections.Generic.List<string>();
                                foreach (var col in columns)
                                {
                                    row_data.Add(col.InnerText.Trim());
                                }
                                data.Add(row_data);
                            }
                        }
    
                        // Print the scraped data for all presidents
                        foreach (var president_data in data)
                        {
                            Console.WriteLine("President Data:");
                            Console.WriteLine("Number: " + president_data[0]);
                            Console.WriteLine("Name: " + president_data[2]);
                            Console.WriteLine("Term: " + president_data[3]);
                            Console.WriteLine("Party: " + president_data[5]);
                            Console.WriteLine("Election: " + president_data[6]);
                            Console.WriteLine("Vice President: " + president_data[7]);
                            Console.WriteLine();
                        }
                    }
                    else
                    {
                        Console.WriteLine("Failed to find the table with the specified class name.");
                    }
                }
                else
                {
                    Console.WriteLine("Failed to retrieve the web page.");
                }
            }
        }
    }
    

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: