Scraping Yelp Business Listings using CSharp

Dec 6, 2023 ยท 7 min read

Yelp is one of the largest crowdsourced review sites, with over 200 million reviews of local businesses around the world. The depth of data on Yelp (ratings, price levels, photos etc.) makes it an attractive target for scraping. You may want to gather and analyze Yelp data for market research, lead generation, competitor analysis, or a custom business directory.

This is the page we are talking about

However, consumer sites like Yelp actively block scraping bots to prevent data theft. That's where premium proxies come in handy...

Using Premium Proxies to Bypass Yelp Blocks

Like most large sites, Yelp utilizes advanced anti-scraping mechanisms to detect bots and block IP addresses making too many requests. Trying to scrape Yelp straight from your own IP would fail pretty quickly.

ProxiesAPI offers constantly rotating premium residential IPs from around the world. By routing our requests through ProxiesAPI instead of your own IP, we can imitate organic human traffic patterns and bypass blocks. It's an essential technique for successfully scraping guarded sites like Yelp at scale.

Okay, with that primer out of the way, let's dive into the code...

Importing Required Packages

We'll utilize HtmlAgilityPack to parse HTML and pull data from page elements:

using System;
using System.Net.Http;
using System.Threading.Tasks;
using HtmlAgilityPack;

HttpClient handles our web requests. We create a static instance to reuse across requests:

static readonly HttpClient client = new HttpClient();

Crafting Our Yelp Search URL

We want to scrape Chinese restaurants in San Francisco. Yelp makes this easy with search filters:

string url = "<https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco,+CA>";

To leverage ProxiesAPI, we pass our Yelp URL into the API and get back a proxied URL:

string api_url = $"<http://api.proxiesapi.com/?premium=true&auth_key=YOUR AUTH KEY&url={Uri.EscapeDataString(url)}>";

Make sure to use your own key. We URI-encode the Yelp URL to handle special characters properly.

Configuring Request Headers

Yelp will spot a basic bot, so we spoof a Chrome browser visit by setting valid user-agent, language and other headers:

client.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0...");

client.DefaultRequestHeaders.Add("Accept-Language", "en-US,en;q=0.5");
client.DefaultRequestHeaders.Add("Accept-Encoding", "gzip, deflate, br");
client.DefaultRequestHeaders.Add("Referer", "<https://www.google.com/>");

This tricks Yelp into serving us actual site content instead of error pages.

Making the Initial Request

With our proxied API URL and mimicked browser headers, we can send the GET request:

HttpResponseMessage response = await client.GetAsync(api_url);

response.EnsureSuccessStatusCode();

string responseBody = await response.Content.ReadAsStringAsync();

We confirm we get a 2XX status code before reading the full HTML response body.

Parsing Listings with XPath

Now the fun part - extracting data! HtmlAgilityPack allows us to query elements using XPath syntax.

First we load the HTML:

var doc = new HtmlDocument();

doc.LoadHtml(responseBody);

Then we grab all listing nodes, using the key CSS class names:

Inspecting the page

When we inspect the page we can see that the div has classes called arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x

var listings = doc.DocumentNode.SelectNodes("//div[contains(@class, 'arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x')]");

Console.WriteLine(listings?.Count ?? 0);

XPath Axes like //div let us search the entire DOM for nodes matching our class query. The ?. operator avoids null reference errors.

We print the number of listings found to confirm it works.

Extracting Listing Data

With a list of DOM nodes in hand, we iterate through to extract info:

foreach (var listing in listings)
{
  // Get business name
  var businessNameNode = listing.SelectSingleNode(".//a[contains(@class, 'css-19v1rkv')]");

  string businessName = businessNameNode != null ? businessNameNode.InnerText.Trim() : "N/A";

  // Get rating
  var ratingNode = listing.SelectSingleNode(".//span[contains(@class, 'css-gutk1c')]");

  string rating = ratingNode != null ? ratingNode.InnerText.Trim() : "N/A";

  // Get price range
  var priceRangeNode = listing.SelectSingleNode(".//span[contains(@class, 'priceRange__09f24__mmOuH')]");

  string priceRange = priceRangeNode != null ? priceRangeNode.InnerText.Trim() : "N/A";

  // Get review count and location
  string numReviews = "N/A";
  string location = "N/A";

  var spanElements = listing.SelectNodes(".//span[contains(@class, 'css-chan6m')]");

  if (spanElements != null)
  {
     // Logic to handle multiple layouts
     ...
  }

  // Print extracted data
  Console.WriteLine($"Name: {businessName}");
  Console.WriteLine($"Rating: {rating}");
  ...
}

The key is crafting XPath queries specific to each data field, handling when nodes don't exist, dealing with multiple potential markup layouts, and printing the scraped attributes.

I've omitted some code for brevity, see the full sample below.

Dealing with Errors

We wrap our request in a try/catch block to handle issues:

try {
  // Request code
}
catch (HttpRequestException e)
{
  Console.WriteLine($"Error: {e.Message}");
}

This lets us print the error message without breaking runtime.

Recommended Next Steps

And that's the basics of scraping Yelp listings! Here are some ideas for leveling up:

  • Scrapy framework for large scale web crawling
  • Analyzing data in Pandas
  • Building a custom API around scraped data
  • Expanding to other review sites like Google Maps
  • Whether you want to collect data for research or business purposes, I hope this gives you a blueprint to start scraping powerful sites like Yelp. Never hesitate to reach out with questions!

    Full Code Sample

    Here is the complete code sample from the article for your reference:

    using System;
    using System.Net.Http;
    using System.Threading.Tasks;
    using HtmlAgilityPack;
    
    class Program
    {
        static readonly HttpClient client = new HttpClient();
    
        static async Task Main()
        {
            // URL of the Yelp search page
            string url = "https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA";
            
            // API URL with the encoded Yelp URL
            string api_url = $"http://api.proxiesapi.com/?premium=true&auth_key=c0f0dcf86c434ca0ec8ddae676599a19_sr98766_ooPq87&url={Uri.EscapeDataString(url)}";
    
            try
            {
                // Set the necessary headers
                client.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36");
                client.DefaultRequestHeaders.Add("Accept-Language", "en-US,en;q=0.5");
                client.DefaultRequestHeaders.Add("Accept-Encoding", "gzip, deflate, br");
                client.DefaultRequestHeaders.Add("Referer", "https://www.google.com/");
    
                // Send an HTTP GET request to the URL
                HttpResponseMessage response = await client.GetAsync(api_url);
                response.EnsureSuccessStatusCode();
                string responseBody = await response.Content.ReadAsStringAsync();
    
                // Parse the HTML content of the page
                var doc = new HtmlDocument();
                doc.LoadHtml(responseBody);
    
                // Find all the listings
                var listings = doc.DocumentNode.SelectNodes("//div[contains(@class, 'arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x')]");
                
                Console.WriteLine(listings?.Count ?? 0);
    
                // Loop through each listing and extract information
                foreach (var listing in listings)
                {
                    // Extract business name
                    var businessNameNode = listing.SelectSingleNode(".//a[contains(@class, 'css-19v1rkv')]");
                    string businessName = businessNameNode != null ? businessNameNode.InnerText.Trim() : "N/A";
    
                    if (businessName != "N/A")
                    {
                        // Extract rating
                        var ratingNode = listing.SelectSingleNode(".//span[contains(@class, 'css-gutk1c')]");
                        string rating = ratingNode != null ? ratingNode.InnerText.Trim() : "N/A";
    
                        // Extract price range
                        var priceRangeNode = listing.SelectSingleNode(".//span[contains(@class, 'priceRange__09f24__mmOuH')]");
                        string priceRange = priceRangeNode != null ? priceRangeNode.InnerText.Trim() : "N/A";
    
                        // Initialize num_reviews and location
                        string numReviews = "N/A";
                        string location = "N/A";
    
                        var spanElements = listing.SelectNodes(".//span[contains(@class, 'css-chan6m')]");
                        if (spanElements != null)
                        {
                            if (spanElements.Count >= 2)
                            {
                                numReviews = spanElements[0].InnerText.Trim();
                                location = spanElements[1].InnerText.Trim();
                            }
                            else if (spanElements.Count == 1)
                            {
                                var text = spanElements[0].InnerText.Trim();
                                if (int.TryParse(text, out _))
                                {
                                    numReviews = text;
                                }
                                else
                                {
                                    location = text;
                                }
                            }
                        }
    
                        // Print the extracted information
                        Console.WriteLine($"Business Name: {businessName}");
                        Console.WriteLine($"Rating: {rating}");
                        Console.WriteLine($"Number of Reviews: {numReviews}");
                        Console.WriteLine($"Price Range: {priceRange}");
                        Console.WriteLine($"Location: {location}");
                        Console.WriteLine(new string('=', 30));
                    }
                }
            }
            catch (HttpRequestException e)
            {
                Console.WriteLine($"Error: {e.Message}");
            }
        }
    }

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!