Scraping Business Listings from Yelp with Objective C

Dec 6, 2023 ยท 8 min read

Introduction

Scraping business listings from Yelp can provide useful data about local businesses, their reviews, price ranges, locations, and more. This information can power business intelligence tools, market analysis, lead generation, and other applications.

In this comprehensive guide, we'll walk through a full Objective-C scraper to extract key details on Chinese restaurant listings in San Francisco from the Yelp website.

This is the page we are talking about

Here's the exact data we'll pull from each listing:

  • Business Name
  • Rating
  • Number of Reviews
  • Price Range
  • Location
  • We'll use the proxies API from ProxiesAPI to bypass Yelp's anti-scraper protections. As we'll see, premium proxies that rotate IP addresses are essential for scraping sites like Yelp without quickly getting blocked.

    Install Dependencies

    Let's quickly cover installing the dependencies we'll need:

    TFHpple

    This Objective-C library parses HTML/XML documents and allows XPath queries to extract data.

    pod 'TFHpple'
    

    The scraper also relies on Foundation and other standard Objective-C libraries.

    With the imports and dependencies handled, let's get to the data extraction!

    Encode the Target URL

    We first construct the target URL pointing to Yelp listings in San Francisco:

    NSString *urlString = @"<https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA>";
    

    Next we URL-encode this string to handle any special characters:

    NSString *encodedURLString = [urlString stringByAddingPercentEncodingWithAllowedCharacters:[NSCharacterSet URLQueryAllowedCharacterSet]];
    

    This encoded URL will be embedded in the request to ProxiesAPI.

    Use Premium Proxies

    To avoid immediately getting blocked by Yelp's bot detection, we'll use the premium proxy API from ProxiesAPI:

    NSString *apiURLString = [NSString stringWithFormat:@"<http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=%@>", encodedURLString];
    

    Key things to note:

  • Authenticate with your own auth_key
  • The premium=true parameter gives us access to IP-rotating residential proxies that mimic real users
  • Our target Yelp URL is appended to the end
  • So each request will go through a different proxy IP, fooling Yelp into thinking it's organic user traffic. Sneaky! ๐Ÿ˜‰

    Set HTTP Headers

    We next construct a dictionary of request headers that mimic a real Chrome browser:

    NSDictionary *headers = @{
      @"User-Agent": @"Mozilla/5.0...",
      @"Accept-Language": @"en-US,en;q=0.5",
      @"Accept-Encoding": @"gzip, deflate, br",
      @"Referer": @"<https://www.google.com/>"
    };
    

    And convert the headers into the required NSURLRequestHTTPHeaderField array format:

    NSMutableArray *headerFields = [NSMutableArray array];
    [headers enumerateKeysAndObjectsUsingBlock:^(NSString *key, NSString *value, BOOL *stop) {
      [headerFields addObject:[NSURLRequest requestHTTPHeaderFieldWithName:key value:value]];
    }];
    

    Mimicking a real browser via headers decreases the chances of getting flagged as a bot.

    Construct NSURLRequest

    We assemble all the pieces into an NSMutableURLRequest object:

    NSURLComponents *components = [NSURLComponents componentsWithString:apiURLString];
    
    NSMutableURLRequest *request = [NSMutableURLRequest requestWithURL:components.URL];
    request.allHTTPHeaderFields = [NSDictionary dictionaryWithObjects:headerFields
                                       forKeys:[headerFields valueForKey:@"name"]];
    request.HTTPMethod = @"GET";
    

    This request points to the ProxiesAPI URL, includes our mimic-browser headers, and performs a GET.

    Make the HTTP Request

    With our request prepped, we kick it off:

    NSURLSession *session = [NSURLSession sharedSession];
    NSURLSessionDataTask *task = [session dataTaskWithRequest:request
                                              completionHandler:...];
    
    [task resume];
    

    The code handles the async response in the completion block:

  • Parsing response data
  • Checking status code
  • Extracting HTML
  • Passing HTML to TFHpple parser
  • Now the fun begins - using XPath to extract fields!

    Extract Business Listings

    With the HTML loaded into a TFHpple parser object, we can query elements using XPath syntax.

    Inspecting the page

    When we inspect the page we can see that the div has classes called arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x

    First we grab all the listings containers:

    NSArray *listings = [parser searchWithXPathQuery:@"//div[contains(@class,'arrange-unit__09f24__rqHTg')]"];
    

    Key things to note:

  • Double slashes // says find this element anywhere in document
  • contains(@class, 'arrange-unit') matches the CSS class
  • [ ... ] returns all matching elements in an NSArray
  • Then we loop through each listing:

    for (TFHppleElement *listing in listings) {
    
      // Extract data for this listing
    
    }
    

    Inside the loop, we use very specific XPath queries to extract each data field!

    Extract Business Name

    For business name, we grab the h4 tag inside class css-19v1rkv:

    TFHppleElement *businessNameElement = [listing firstChildWithClassName:@"css-19v1rkv"];
    NSString *businessName = [businessNameElement text];
    

    This neatly returns just the business name string!

    Extract Rating, Reviews, Price, Location

    The other fields require more nuanced XPath queries:

    // Rating
    TFHppleElement *ratingElement = [listing firstChildWithClassName:@"css-gutk1c"];
    
    // Number of Reviews
    NSArray *spanElements = [listing searchWithXPathQuery:@"//span[contains(@class,'css-chan6m')]"];
    
    // Price Range
    TFHppleElement *priceRangeElement = [listing firstChildWithClassName:@"priceRange__09f24__mmOuH"];
    
    // Location
    NSString *location = @"N/A";
    
    if ([spanElements count] >= 2) {
      location = [[spanElements[1] text] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
    }
    

    We have to handle cases where fields are missing or contain unpredictable whitespace in the HTML.

    But ultimately we extract and print all the pieces we need:

    NSLog(@"Business Name: %@", businessName);
    NSLog(@"Rating: %@", rating);
    // etc...
    

    The full code handles edge cases and surfaces everything in an easy-to-process structure.

    Key Takeaways

    Scraping Yelp listings relies heavily on:

  • Rotating Proxies - Avoid bot blocking by mimicking organic traffic
  • Custom Headers - Masquerade requests as a real browser
  • XPath Selectors - Carefully target DOM elements to extract fields
  • With these key ingredients, you can build robust Yelp scrapers in Objective-C and other languages.

    Next Steps

    To expand on this project:

  • Build a pipeline to store data in databases
  • Expand to scrape other business info from Yelp
  • Containerize the scraper for server deployment
  • Hopefully this gives you a firm handle on tackling third-party sites like Yelp. Happy scraping!

    Full Objective-C Code

    Here again is the full scraper code:

    #import <Foundation/Foundation.h>
    #import "TFHpple.h"
    
    int main(int argc, const char * argv[]) {
        @autoreleasepool {
            NSString *urlString = @"https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA";
            
            // URL-encode the URL
            NSString *encodedURLString = [urlString stringByAddingPercentEncodingWithAllowedCharacters:[NSCharacterSet URLQueryAllowedCharacterSet]];
            
            // API URL with the encoded URL
            NSString *apiURLString = [NSString stringWithFormat:@"http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=%@", encodedURLString];
            
            // Define user-agent header and other headers
            NSDictionary *headers = @{
                @"User-Agent": @"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
                @"Accept-Language": @"en-US,en;q=0.5",
                @"Accept-Encoding": @"gzip, deflate, br",
                @"Referer": @"https://www.google.com/"
            };
            
            // Convert headers to an array of NSURLRequestHTTPHeaderField objects
            NSMutableArray *headerFields = [NSMutableArray array];
            [headers enumerateKeysAndObjectsUsingBlock:^(NSString *key, NSString *value, BOOL *stop) {
                [headerFields addObject:[NSURLRequest requestHTTPHeaderFieldWithName:key value:value]];
            }];
            
            // Create an NSURLComponents object to build the URL
            NSURLComponents *components = [NSURLComponents componentsWithString:apiURLString];
            
            // Create an NSURLRequest object with the URL and headers
            NSMutableURLRequest *request = [NSMutableURLRequest requestWithURL:components.URL];
            request.allHTTPHeaderFields = [NSDictionary dictionaryWithObjects:headerFields forKeys:[headerFields valueForKey:@"name"]];
            request.HTTPMethod = @"GET";
            
            // Send an HTTP GET request
            NSURLSession *session = [NSURLSession sharedSession];
            NSURLSessionDataTask *task = [session dataTaskWithRequest:request completionHandler:^(NSData *data, NSURLResponse *response, NSError *error) {
                if (error) {
                    NSLog(@"Failed to retrieve data. Error: %@", error.localizedDescription);
                } else {
                    NSHTTPURLResponse *httpResponse = (NSHTTPURLResponse *)response;
                    if (httpResponse.statusCode == 200) {
                        NSString *htmlString = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
                        
                        // Save the HTML to a file (optional)
                        [htmlString writeToFile:@"yelp_html.html" atomically:YES encoding:NSUTF8StringEncoding error:nil];
                        
                        // Parse the HTML content using TFHpple
                        TFHpple *parser = [TFHpple hppleWithHTMLData:data];
                        
                        // Find all the listings
                        NSArray *listings = [parser searchWithXPathQuery:@"//div[contains(@class,'arrange-unit__09f24__rqHTg') and contains(@class,'arrange-unit-fill__09f24__CUubG') and contains(@class,'css-1qn0b6x')]"];
                        
                        NSLog(@"Number of Listings: %ld", (long)[listings count]);
                        
                        // Loop through each listing and extract information
                        for (TFHppleElement *listing in listings) {
                            // Extract information here
                            
                            // Extract business name
                            TFHppleElement *businessNameElement = [listing firstChildWithClassName:@"css-19v1rkv"];
                            NSString *businessName = [businessNameElement text];
                            
                            // Extract rating
                            TFHppleElement *ratingElement = [listing firstChildWithClassName:@"css-gutk1c"];
                            NSString *rating = [ratingElement text];
                            
                            // Extract price range
                            TFHppleElement *priceRangeElement = [listing firstChildWithClassName:@"priceRange__09f24__mmOuH"];
                            NSString *priceRange = [priceRangeElement text];
                            
                            // Extract number of reviews and location
                            NSArray *spanElements = [listing searchWithXPathQuery:@"//span[contains(@class,'css-chan6m')]"];
                            NSString *numReviews = @"N/A";
                            NSString *location = @"N/A";
                            
                            if ([spanElements count] >= 2) {
                                numReviews = [[spanElements[0] text] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
                                location = [[spanElements[1] text] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
                            } else if ([spanElements count] == 1) {
                                NSString *text = [[spanElements[0] text] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
                                if ([text integerValue] > 0) {
                                    numReviews = text;
                                } else {
                                    location = text;
                                }
                            }
                            
                            // Print the extracted information
                            NSLog(@"Business Name: %@", businessName);
                            NSLog(@"Rating: %@", rating);
                            NSLog(@"Number of Reviews: %@", numReviews);
                            NSLog(@"Price Range: %@", priceRange);
                            NSLog(@"Location: %@", location);
                            NSLog(@"===========================");
                        }
                    } else {
                        NSLog(@"Failed to retrieve data. Status Code: %ld", (long)httpResponse.statusCode);
                    }
                }
            }];
            
            [task resume];
            
            [[NSRunLoop currentRunLoop] run];
        }
        return 0;
    }

    The code runs as-is - just insert your own ProxiesAPI auth key and try it out! Let me know if any part needs more explanation.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!