Scraping Hacker News in CSharp

Jan 21, 2024 · 9 min read

Web scraping is the process of extracting data from websites automatically. This is useful when the site you want data from doesn't have an official API or database export option. For learning purposes, websites like Hacker News make great scraping targets since they have well-structured data.

In this beginner tutorial, we'll walk through a C# program that scrapes news articles from the Hacker News homepage. We'll cover:

  • Importing required namespaces
  • Defining URLs and initializing HttpClient
  • Sending requests and handling responses
  • Parsing HTML using HtmlAgilityPack
  • Selecting page elements using XPath
  • Extracting and printing article data
  • By the end, you'll have a solid grasp of how web scrapers work!

    This is the page we are talking about…

    Introduction to Hacker News

    Hacker News is a popular social news site focused on computer science and entrepreneurship. Users can submit links to articles, upvote submissions they like, and comment.

    Our program will scrape the front page, extracting details like article titles, scores, authors etc. The goal is to demonstrate web scraping concepts - we won't actually do anything with this data.

    With some small tweaks, you could adapt this scraper to any site that uses tables/list layouts like Hacker News.

    Let's get started!

    Namespaces for HTTP Requests and HTML Parsing

    C# namespaces allow our code to use classes defined elsewhere without needing to qualify them. This scraper uses two external namespaces:

    using System.Net.Http;
    using HtmlAgilityPack;
    

    System.Net.Http contains HttpClient for making web requests.

    HtmlAgilityPack allows parsing and querying HTML using methods like LoadHtml() and XPath.

    The using directives import these namespaces so we can use HttpClient and HtmlAgilityPack directly.

    The Program Class

    Our code goes inside a class named Program:

    class Program
    {
      // Code here
    }
    

    And specifically in the Main() method, which runs first when executed:

    static async Task Main(string[] args)
    {
      // Code here
    }
    
  • static means Main() can be called without creating a Program instance
  • async means it can use await for asynchronous calls
  • Task is the return type
  • This is a common pattern for C# command line programs. Code goes inside Main().

    Defining the URL

    First we set the target URL to scrape:

    string url = "<https://news.ycombinator.com/>";
    

    This is the homepage URL for Hacker News. We could scrape any webpage by changing this URL.

    Initializing the HttpClient

    HttpClient is used to make HTTP requests in C#:

    using (HttpClient client = new HttpClient())
    {
      // Send request
    }
    

    This initializes an HttpClient instance via the constructor syntax new HttpClass().

    Using a using block ensures the client cleanly disposes network resources when done.

    Sending the GET Request

    To fetch the web page HTML, we use HttpClient.GetAsync():

    HttpResponseMessage response = await client.GetAsync(url);
    

    We await the async call and store the response in a HttpResponseMessage object.

    Checking for Success

    It's good practice to verify requests succeeded before processing further:

    if (response.IsSuccessStatusCode)
    {
      // Process response
    }
    else
    {
      Console.WriteLine("Request failed!");
    }
    

    The status code property gives the HTTP response status. We want status 200 OK.

    This avoids errors from trying to parse failed responses.

    Reading the Response HTML

    To access the HTML body, we extract the content as a string:

    string html = await response.Content.ReadAsStringAsync();
    

    The ReadAsStringAsync() method returns the entire response body.

    Loading the HTML into HtmlDocument

    Before we can query elements, the HTML needs parsing into a DOM document object:

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);
    

    LoadHtml() parses string HTML into an HtmlDocument representing the DOM tree.

    Finding Row Nodes

    Inspecting the page

    You can notice that the items are housed inside a tag with the class athing

    With the DOM loaded, we can use XPath queries to select elements:

    var rows = doc.DocumentNode
                  .SelectNodes("//tr");
    

    This selects all nodes within //tr anywhere in the document. These rows contain the main content.

    Tracking State as We Iterate Rows

    Because upcoming logic depends on which row is current, we define flags to track state:

    HtmlNode currentArticle = null;
    string currentRowType = null;
    
  • currentArticle: the latest article node
  • currentRowType: whether current row is an "article" or "details" row
  • These will be updated on each loop iteration.

    Looping Through Rows

    We can now iterate through the rows, identifying what each one contains:

    foreach (var row in rows)
    {
      // Check row type
    
      // If article row:
        // Extract article data from row & details row
    
      // Update current states
    }
    

    The approach:

    1. Classify each row
    2. If article row, extract data
    3. Update state variables for next iteration

    This takes advantage of the well-structured table layout.

    Identifying Article Rows

    We first check if a row contains an article using a class attribute:

    if (row.GetAttributeValue("class", "") == "athing")
    {
      // This is an article row
    
      currentArticle = row;
      currentRowType = "article";
    }
    

    Article rows have CSS class athing. If match, we set currentArticle and currentRowType states.

    Identifying Details Rows

    The details about an article are in the next row:

    else if (currentRowType == "article")
    {
      // This is a details row
    
      // Extract details
    
      currentArticle = null;
      currentRowType = null;
    }
    

    If state shows previous row was an article, current row holds the additional details.

    We null the state variables once handled to prepare for the next potential article.

    Extracting Article Data

    Inside the article block, we can pluck data using more specific XPath queries.

    Title

    var titleElem = currentArticle.SelectSingleNode(".//span[@class='title']");
    
    if (titleElem != null)
    {
      string title = titleElem.Element("a").InnerText;
    }
    

    Here we:

    1. Select the element under currentArticle
    2. Get the inner anchor tag
    3. Retrieve its InnerText property to extract the text itself

    This will have the article title.

    URL

    Similarly for URL:

    Instead of text, we get the anchor's href attribute value containing the link.

    Points, Author, Other Data

    The general pattern is:

    1. Use SelectSingleNode() to pinpoint an element with a unique selector
    2. Extract inner text or attributes from it

    For example:

    The key is uniquely identifying elements using classes, attribute filters, nested selection etc.

    Printing Extracted Data

    With data parsed from each article row and details row, we can print the results:

    And loop back to the next article!

    This outputs the scraped data to the console.

    Next Steps

    The full program is below for reference.

    With this foundation, you could:

  • Save data to a database, file or API
  • Add caching for faster performance
  • Expand to scrape additional fields
  • Generalize scraper to handle other sites
  • The concepts learned here apply to almost any web scraping project!

    Full Hacker News Scraper Code

    Here is the complete code for reference:

    using System;
    using System.Linq;
    using System.Net.Http;
    using HtmlAgilityPack;
    
    class Program
    {
        static async System.Threading.Tasks.Task Main(string[] args)
        {
            // Define the URL of the Hacker News homepage
            string url = "https://news.ycombinator.com/";
    
            // Initialize HttpClient
            using (HttpClient client = new HttpClient())
            {
                // Send a GET request to the URL
                HttpResponseMessage response = await client.GetAsync(url);
    
                // Check if the request was successful (status code 200)
                if (response.IsSuccessStatusCode)
                {
                    // Read the HTML content of the page
                    string htmlContent = await response.Content.ReadAsStringAsync();
    
                    // Load the HTML content into an HtmlDocument
                    HtmlDocument doc = new HtmlDocument();
                    doc.LoadHtml(htmlContent);
    
                    // Find all rows in the table
                    var rows = doc.DocumentNode.SelectNodes("//tr");
    
                    // Initialize variables to keep track of the current article and row type
                    HtmlNode currentArticle = null;
                    string currentRowType = null;
    
                    // Iterate through the rows to scrape articles
                    foreach (var row in rows)
                    {
                        if (row.GetAttributeValue("class", "") == "athing")
                        {
                            // This is an article row
                            currentArticle = row;
                            currentRowType = "article";
                        }
                        else if (currentRowType == "article")
                        {
                            // This is the details row
                            if (currentArticle != null)
                            {
                                // Extract information from the current article and details row
                                var titleElem = currentArticle.SelectSingleNode(".//span[@class='title']");
                                if (titleElem != null)
                                {
                                    string articleTitle = titleElem.Element("a").InnerText;
                                    string articleUrl = titleElem.Element("a").GetAttributeValue("href", "");
    
                                    var subtext = row.SelectSingleNode(".//td[@class='subtext']");
                                    string points = subtext.SelectSingleNode(".//span[@class='score']").InnerText;
                                    string author = subtext.SelectSingleNode(".//a[@class='hnuser']").InnerText;
                                    string timestamp = subtext.SelectSingleNode(".//span[@class='age']").GetAttributeValue("title", "");
                                    var commentsElem = subtext.SelectSingleNode(".//a[contains(text(),'comments')]");
                                    string comments = commentsElem != null ? commentsElem.InnerText : "0";
    
                                    // Print the extracted information
                                    Console.WriteLine("Title: " + articleTitle);
                                    Console.WriteLine("URL: " + articleUrl);
                                    Console.WriteLine("Points: " + points);
                                    Console.WriteLine("Author: " + author);
                                    Console.WriteLine("Timestamp: " + timestamp);
                                    Console.WriteLine("Comments: " + comments);
                                    Console.WriteLine(new string('-', 50)); // Separating articles
                                }
                            }
    
                            // Reset the current article and row type
                            currentArticle = null;
                            currentRowType = null;
                        }
                        else if (row.GetAttributeValue("style", "") == "height:5px")
                        {
                            // This is the spacer row, skip it
                            continue;
                        }
                    }
                }
                else
                {
                    Console.WriteLine("Failed to retrieve the page. Status code: " + (int)response.StatusCode);
                }
            }
        }
    }
    

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!