Scraping Real Estate Listings from Realtor with CSharp

Jan 9, 2024 · 5 min read

Web scraping refers to programmatically extracting data from websites. Here we'll use C# and the handy HtmlAgilityPack library to scrape real estate listing data from Realtor.com.

We'll walk through a full working script, explain how each part functions, and pay special attention to the key data extraction logic. We assume some programming experience but aim to provide ample detail for beginners.

This is the listings page we are talking about…

Getting Set Up

Since we are using C#, you will need:

  • The .NET SDK installed
  • A C# IDE like Visual Studio
  • The HtmlAgilityPack NuGet package installed
  • To install this last dependency, search "HtmlAgilityPack" in the NuGet package manager interface in your IDE of choice. Once added to your project, you can access all functionality via the HtmlAgilityPack namespace.

    Now let's dive into the code!

    Walkthrough

    First we import namespaces:

    using System;
    using System.Net.Http;
    using HtmlAgilityPack;
    

    The key one here is HtmlAgilityPack which we just installed.

    Next we define our target URL and user agent string:

    string url = "<https://www.realtor.com/realestateandhomes-search/San-Francisco_CA>";
    
    string userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";
    

    I opted to scrape listings in San Francisco, but this could be any location search on Realtor.com.

    The user agent mimics a Chrome browser. Sites like Realtor will block suspicious requests lacking user agent info.

    We next construct an HttpClient and pass our user agent string:

    HttpClient httpClient = new HttpClient();
    
    httpClient.DefaultRequestHeaders.Add("User-Agent", userAgent);
    

    Now we can make the GET request to fetch the page HTML:

    HttpResponseMessage response = await httpClient.GetAsync(url);
    

    It's smart to check that the request succeeded before trying to parse:

    if (response.IsSuccessStatusCode)
    {
      // scraping logic here
    }
    else  {
       Console.WriteLine("Failed to retrieve the page. Status code: " + (int)response.StatusCode);
    }
    

    Okay, we have fetched the page HTML. Time to extract data!

    We load up an HtmlDocument using the HtmlAgilityPack library:

    string htmlContent = await response.Content.ReadAsStringAsync();
    
    HtmlDocument htmlDocument = new HtmlDocument();
    htmlDocument.LoadHtml(htmlContent);
    

    Now the htmlDocument contains a parsed representation of the page DOM tree.

    Inspecting the element

    When we inspect element in Chrome we can see that each of the listing blocks is wrapped in a div with a class value as shown below…

    We can use XPath queries to pinpoint specific elements and extract inner text values. This is usually the trickiest part for beginners.

    Let's break it down step-by-step:

    var listingBlocks = htmlDocument.DocumentNode.SelectNodes("//div[contains(@class, 'BasePropertyCard_propertyCardWrap__J0xUj')]");
    

    This grabs all the high level div blocks for each listing. We identity them by a distinct class name.

    We then iterate through each listing:

    foreach (var listingBlock in listingBlocks)
    {
      // Extract data for this listing
    }
    

    Inside here we can apply more XPath queries against listingBlock to extract specifics.

    For example the broker name and company:

    var brokerInfo = listingBlock.SelectSingleNode(".//div[contains(@class, 'BrokerTitle_brokerTitle__ZkbBW')]");
    
    string brokerName = brokerInfo.SelectSingleNode(".//span[contains(@class, 'BrokerTitle_titleText__20u1P')]").InnerText.Trim();
    

    We grab the high level broker div, then drill into a span tag to get the name text.

    Some other examples:

    Price:

    string price = listingBlock.SelectSingleNode(".//div[contains(@class, 'card-price')]").InnerText.Trim();
    

    Address:

    string address = listingBlock.SelectSingleNode(".//div[contains(@class, 'card-address')]").InnerText.Trim();
    

    Beds and baths:

    string beds = listingBlock.SelectSingleNode(".//li[@data-testid='property-meta-beds']").InnerText.Trim();
    
    string baths = listingBlock.SelectSingleNode(".//li[@data-testid='property-meta-baths']").InnerText.Trim();
    

    Here we leveraged a data-testid attribute for uniqueness.

    At the end, we print the extracted data:

    Console.WriteLine("Broker: " + brokerName);
    // etc
    

    And that's the core of it! The full code:

    using System;
    using System.Net.Http;
    using HtmlAgilityPack;
    
    class Program
    {
        static async System.Threading.Tasks.Task Main(string[] args)
        {
            // Define the URL of the Realtor.com search page
            string url = "https://www.realtor.com/realestateandhomes-search/San-Francisco_CA";
    
            // Define a User-Agent header
            string userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";
    
            // Create an HttpClient with the User-Agent header
            HttpClient httpClient = new HttpClient();
            httpClient.DefaultRequestHeaders.Add("User-Agent", userAgent);
    
            // Send a GET request to the URL
            HttpResponseMessage response = await httpClient.GetAsync(url);
    
            // Check if the request was successful (status code 200)
            if (response.IsSuccessStatusCode)
            {
                // Parse the HTML content of the page using HtmlAgilityPack
                string htmlContent = await response.Content.ReadAsStringAsync();
                HtmlDocument htmlDocument = new HtmlDocument();
                htmlDocument.LoadHtml(htmlContent);
    
                // Find all the listing blocks using the provided class name
                var listingBlocks = htmlDocument.DocumentNode.SelectNodes("//div[contains(@class, 'BasePropertyCard_propertyCardWrap__J0xUj')]");
    
                // Loop through each listing block and extract information
                foreach (var listingBlock in listingBlocks)
                {
                    // Extract the broker information
                    var brokerInfo = listingBlock.SelectSingleNode(".//div[contains(@class, 'BrokerTitle_brokerTitle__ZkbBW')]");
                    string brokerName = brokerInfo.SelectSingleNode(".//span[contains(@class, 'BrokerTitle_titleText__20u1P')]").InnerText.Trim();
    
                    // Extract the status (e.g., For Sale)
                    string status = listingBlock.SelectSingleNode(".//div[contains(@class, 'message')]").InnerText.Trim();
    
                    // Extract the price
                    string price = listingBlock.SelectSingleNode(".//div[contains(@class, 'card-price')]").InnerText.Trim();
    
                    // Extract other details like beds, baths, sqft, and lot size
                    string beds = listingBlock.SelectSingleNode(".//li[@data-testid='property-meta-beds']").InnerText.Trim();
                    string baths = listingBlock.SelectSingleNode(".//li[@data-testid='property-meta-baths']").InnerText.Trim();
                    string sqft = listingBlock.SelectSingleNode(".//li[@data-testid='property-meta-sqft']").InnerText.Trim();
                    string lotSize = listingBlock.SelectSingleNode(".//li[@data-testid='property-meta-lot-size']").InnerText.Trim();
    
                    // Extract the address
                    string address = listingBlock.SelectSingleNode(".//div[contains(@class, 'card-address')]").InnerText.Trim();
    
                    // Print the extracted information
                    Console.WriteLine("Broker: " + brokerName);
                    Console.WriteLine("Status: " + status);
                    Console.WriteLine("Price: " + price);
                    Console.WriteLine("Beds: " + beds);
                    Console.WriteLine("Baths: " + baths);
                    Console.WriteLine("Sqft: " + sqft);
                    Console.WriteLine("Lot Size: " + lotSize);
                    Console.WriteLine("Address: " + address);
                    Console.WriteLine(new string('-', 50));  // Separating listings
                }
            }
            else
            {
                Console.WriteLine("Failed to retrieve the page. Status code: " + (int)response.StatusCode);
            }
        }
    }
    

    There are many optimizations we could make, but hopefully this gives a good overview of how to harness the power of HtmlAgilityPack and XPath to scrape webpages in C#.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!