Scraping Yelp Business Listings with C++

Dec 6, 2023 ยท 10 min read

Web scraping is the technique of automatically extracting data from websites through code. In this article, we'll walk through a full code example of scraping business listing data from Yelp using C++ and two helpful libraries - libcurl and Gumbo.

This is the page we are talking about

The goal is to extract key details like the business name, rating, number of reviews, price range, and location for each search result. This data could then be used for analysis, visualizations, lead generation, and more.

Why Scrape Yelp?

Yelp has rich data on local businesses across many categories like restaurants, shops, services, etc. Manually extracting this data would be tedious. Scraping allows us to extract thousands of listings programmatically. The structured data can then be used for all sorts of purposes:

  • Analyze trends across categories, locations, ratings
  • Create maps/graphs visualizing businesses
  • Generate sales leads by scraping contact info
  • Compare with your own business data
  • Aggregate data into a searchable local business database
  • Of course, check Yelp's terms and conditions before scraping. Now let's dive into the code!

    Imports and Initialization

    We start with including the necessary header files for I/O streams, strings, vectors, CURL, and Gumbo:

    #include <iostream>
    #include <fstream>
    #include <string>
    #include <vector>
    #include <algorithm>
    #include <curl/curl.h>
    #include <gumbo.h>
    

    Then we initialize libcurl, which handles making web requests:

    CURL* curl = curl_easy_init();
    if (!curl) {
        std::cerr << "Failed to initialize libcurl.\\n";
        return 1;
    }
    

    And that's it for setup! We'll explain Gumbo when we get to parsing.

    Constructing the URLs

    We need two URLs - one for the Yelp search page, and one for the ProxiesAPI service API:

    // Yelp search page
    std::string url = "<https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA>";
    
    // Encode URL for API
    std::string encoded_url = curl_easy_escape(curl, url.c_str(), url.length());
    
    // Proxy API URL
    std::string api_url = "<http://api.proxiesapi.com/?premium=true&auth_key=><key>&url=" + encoded_url;
    

    The proxy API allows us to tunnel through proxies on each request, avoiding detection as scrapers by Yelp.

    We URL-encode the Yelp URL before inserting it into the proxy API URL. This makes sure special characters pass safely.

    Setting CURL Options

    To mimic a real browser, we construct header strings and pass them to CURL:

    struct curl_slist* headers = NULL;
    headers = curl_slist_append(headers, "User-Agent: Mozilla/5.0 (...) Chrome/58.0");
    headers = curl_slist_append(headers, "Accept-Language: en-US,en;q=0.5");
    
    curl_easy_setopt(curl, CURLOPT_URL, api_url.c_str());
    curl_easy_setopt(curl, CURLOPT_HTTPHEADER, headers);
    

    The key things set here:

  • api_url: URL to make request to
  • CURLOPT_HTTPHEADER: Headers that mimic Chrome browser
  • User-Agent: Pretends we are Chrome browser
  • Without these options, Yelp may detect us as bots!

    Making the Request

    With CURL configured, we can make the GET request and store the response:

    std::string response_data;
    
    curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
    
    CURLcode res = curl_easy_perform(curl);
    
    if (res != CURLE_OK) {
        std::cerr << "Request failed, error: " << res;
        return 1;
    }
    

    WriteCallback handles saving the response data into our string buffer. We check res to see if the request succeeded before moving on.

    At this point response_data contains the full HTML of the Yelp search results page for San Francisco Chinese restaurants.

    Parsing the HTML

    To extract structured data, we need to parse the HTML content and identify the key data elements. This is where the Gumbo parser comes in handy.

    First we parse and get the root node:

    GumboOutput* output = gumbo_parse(response_data.c_str());
    
    if (!output) {
        std::cerr << "Failed to parse response.";
        return 1;
    }
    
    GumboNode* root = output->root;
    

    root contains the whole HTML document as a node tree we can now traverse.

    Extracting Listing Data

    The key part is identifying patterns in the HTML related to each business listing. We can then extract data for all listings:

    Inspecting the page

    When we inspect the page we can see that the div has classes called arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x

    GumboAttribute* class_attr = gumbo_get_attribute(&current->v.element.attributes, "class");
    if (class_attr && std::string(class_attr->value) == "arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x") {
        listings.push_back(current);
    }

    The key things to understand:

  • Find all the listings by a common CSS class
  • Extract each data field through CSS selectors
  • Text values are stored in v.text.text
  • With these patterns, we can extract any data we want from structured HTML.

    The full code contains additional examples for other data points like number of reviews and price range.

    Next Steps

    The final code puts all the pieces together to scrape Yelp results and extract business listings.

    Some ideas for building on this:

  • Iterate through multiple pages of results
  • Save data to CSV/JSON/database
  • Expand fields extracted
  • Compare vs. your own business data
  • Visualize trends on maps and graphs
  • Web scraping opens up many possibilities for harnessing structured public data! Check the laws in your area and respect sites terms before scraping.

    Let me know if any part of the explanation needs more detail! Here is the full code again for reference:

    #include <iostream>
    #include <fstream>
    #include <string>
    #include <vector>
    #include <algorithm>
    #include <curl/curl.h>
    #include <gumbo.h>
    
    // Function to write HTTP response data to a string
    size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* output) {
        size_t total_size = size * nmemb;
        output->append(static_cast<char*>(contents), total_size);
        return total_size;
    }
    
    int main() {
        // URL of the Yelp search page
        std::string url = "https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA";
    
        // Initialize libcurl
        CURL* curl = curl_easy_init();
        if (!curl) {
            std::cerr << "Failed to initialize libcurl." << std::endl;
            return 1;
        }
    
        // URL-encode the URL
        std::string encoded_url = curl_easy_escape(curl, url.c_str(), url.length());
    
        // API URL with the encoded Yelp URL
        std::string api_url = "http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=" + encoded_url;
    
        // Define a user-agent header to simulate a browser request
        struct curl_slist* headers = NULL;
        headers = curl_slist_append(headers, "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36");
        headers = curl_slist_append(headers, "Accept-Language: en-US,en;q=0.5");
        headers = curl_slist_append(headers, "Accept-Encoding: gzip, deflate, br");
        headers = curl_slist_append(headers, "Referer: https://www.google.com/");
    
        // Configure libcurl
        curl_easy_setopt(curl, CURLOPT_URL, api_url.c_str());
        curl_easy_setopt(curl, CURLOPT_HTTPHEADER, headers);
        curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
        std::string response_data;
    
        // Send an HTTP GET request to the URL with the headers
        CURLcode res = curl_easy_perform(curl);
    
        // Check if the request was successful (status code 200)
        if (res == CURLE_OK) {
            // Write the response data to a file
            std::ofstream file("yelp_html.html", std::ios::out | std::ios::binary);
            file.write(response_data.c_str(), response_data.length());
            file.close();
    
            // Use Gumbo to parse the HTML content
            GumboOutput* output = gumbo_parse(response_data.c_str());
            if (output) {
                GumboNode* root = output->root;
    
                // Find all the listings
                std::vector<GumboNode*> listings;
                GumboNode* current = root;
                while (current) {
                    if (current->type == GUMBO_NODE_ELEMENT && current->v.element.tag == GUMBO_TAG_DIV) {
                        GumboAttribute* class_attr = gumbo_get_attribute(&current->v.element.attributes, "class");
                        if (class_attr && std::string(class_attr->value) == "arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x") {
                            listings.push_back(current);
                        }
                    }
                    current = current->next;
                }
    
                std::cout << "Number of Listings: " << listings.size() << std::endl;
    
                // Loop through each listing and extract information
                for (GumboNode* listing : listings) {
                    // Initialize variables to store extracted information
                    std::string business_name = "N/A";
                    std::string rating = "N/A";
                    std::string num_reviews = "N/A";
                    std::string price_range = "N/A";
                    std::string location = "N/A";
    
                    // Find the business name element
                    GumboNode* business_name_elem = nullptr;
                    current = listing->v.element.children.data;
                    while (current) {
                        if (current->type == GUMBO_NODE_ELEMENT) {
                            GumboAttribute* class_attr = gumbo_get_attribute(&current->v.element.attributes, "class");
                            if (class_attr && std::string(class_attr->value) == "css-19v1rkv") {
                                business_name_elem = current;
                                break;
                            }
                        }
                        current = current->next;
                    }
    
                    // Extract business name if found
                    if (business_name_elem) {
                        business_name = business_name_elem->v.text.text;
                    }
    
                    // Find the rating element
                    GumboNode* rating_elem = nullptr;
                    current = listing->v.element.children.data;
                    while (current) {
                        if (current->type == GUMBO_NODE_ELEMENT) {
                            GumboAttribute* class_attr = gumbo_get_attribute(&current->v.element.attributes, "class");
                            if (class_attr && std::string(class_attr->value) == "css-gutk1c") {
                                rating_elem = current;
                                break;
                            }
                        }
                        current = current->next;
                    }
    
                    // Extract rating if found
                    if (rating_elem) {
                        rating = rating_elem->v.text.text;
                    }
    
                    // Find the span elements for number of reviews and location
                    current = listing->v.element.children.data;
                    while (current) {
                        if (current->type == GUMBO_NODE_ELEMENT) {
                            GumboAttribute* class_attr = gumbo_get_attribute(&current->v.element.attributes, "class");
                            if (class_attr && std::string(class_attr->value) == "css-chan6m") {
                                if (current->v.element.children.length > 0 && current->v.element.children.data->type == GUMBO_NODE_TEXT) {
                                    std::string text = current->v.element.children.data->v.text.text;
                                    if (num_reviews == "N/A" && std::isdigit(text[0])) {
                                        num_reviews = text;
                                    } else {
                                        location = text;
                                    }
                                }
                            }
                        }
                        current = current->next;
                    }
    
                    // Find the price range element
                    GumboNode* price_range_elem = nullptr;
                    current = listing->v.element.children.data;
                    while (current) {
                        if (current->type == GUMBO_NODE_ELEMENT) {
                            GumboAttribute* class_attr = gumbo_get_attribute(&current->v.element.attributes, "class");
                            if (class_attr && std::string(class_attr->value) == "priceRange__09f24__mmOuH") {
                                price_range_elem = current;
                                break;
                            }
                        }
                        current = current->next;
                    }
    
                    // Extract price range if found
                    if (price_range_elem) {
                        price_range = price_range_elem->v.text.text;
                    }
    
                    // Print the extracted information
                    std::cout << "Business Name: " << business_name << std::endl;
                    std::cout << "Rating: " << rating << std::endl;
                    std::cout << "Number of Reviews: " << num_reviews << std::endl;
                    std::cout << "Price Range: " << price_range << std::endl;
                    std::cout << "Location: " << location << std::endl;
                    std::cout << "=============================" << std::endl;
                }
    
                // Clean up Gumbo
                gumbo_destroy_output(&kGumboDefaultOptions, output);
            }
        } else {
            std::cerr << "Failed to retrieve data. Status Code: " << res << std::endl;
        }
    
        // Clean up libcurl
        curl_easy_cleanup(curl);
        curl_slist_free_all(headers);
    
        return 0;
    }

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!