Scraping Yelp Business Listings using Ruby - A step by step guide

Dec 6, 2023 ยท 10 min read

Step 1: Introduction

Imagine you're researching the best Chinese restaurants in San Francisco, and you want to gather data from Yelp to make an informed decision. Web scraping can be your secret weapon in this quest. In this article, we'll walk through the process of scraping Yelp business listings step by step.

This is the page we are talking about

We'll be using Ruby and Nokogiri, powerful tools for web scraping. So, if you're a beginner in web scraping, don't worry; we've got you covered.

Step 2: Set Up the Environment

Before we dive into the code, you'll need to make sure you have Ruby installed on your system. If you haven't already, head over to the official Ruby website (https://www.ruby-lang.org/en/documentation/installation/) to download and install Ruby.

Additionally, we'll be using some Ruby gems (libraries) to help us with web scraping. Open your terminal and run the following commands to install them:

gem install net-http
gem install nokogiri

Now, create a new Ruby script file in your preferred code editor and name it something like scrape_yelp.rb. We'll use this file to organize our code.

Step 3: Import Necessary Libraries

In our Ruby script, we'll start by importing the necessary libraries. Here's what each library does:

  • net/http: This library allows us to make HTTP requests to websites.
  • nokogiri: Nokogiri is a powerful HTML parsing library that helps us extract data from web pages.
  • Now, let's move on to the next step.

    Step 4: Define the Yelp Search URL

    Our first task is to define the URL of the Yelp search page we want to scrape. In our case, we're searching for Chinese restaurants in San Francisco, so our URL looks like this:

    url = "<https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA>"
    

    But before we proceed, we need to URL-encode this URL to ensure it's correctly formatted for use in our code. To do this, we'll use the URI.escape method:

    encoded_url = URI.escape(url, /[:?&=]/)
    

    Now, let's move on to the next step where we'll handle premium proxies to bypass Yelp's anti-bot mechanisms.

    Step 5: Generate the API URL with Premium Proxies

    Here's where things get interesting. Yelp, like many websites, has defenses against web scraping. To circumvent these measures and ensure uninterrupted scraping, we'll use premium proxies.

    In our code, we construct an api_url variable that combines your premium proxy key from ProxiesAPI with the encoded Yelp URL. This URL will act as an intermediary between your script and Yelp, helping you avoid detection.

    api_url = "<http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=#{encoded_url}>"
    

    Make sure to replace YOUR_AUTH_KEY with your actual ProxiesAPI authentication key. If you don't have one, you can sign up for an account on our website (https://proxiesapi.com/).

    Step 6: Set Up Request Headers

    Before we make the request to Yelp, we need to simulate a browser request. This is essential to avoid being flagged as a bot. To do this, we define a headers hash that contains various details like the user-agent, language, encoding, and even a referrer.

    headers = {
      "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
      "Accept-Language" =>
    
     "en-US,en;q=0.5",
      "Accept-Encoding" => "gzip, deflate, br",
      "Referer" => "<https://www.google.com/>",  # Simulate a referrer
    }
    

    These headers make your requests look more like they're coming from a real web browser.

    Step 7: Send an HTTP GET Request

    With our URL, premium proxies, and headers in place, it's time to send an HTTP GET request to the Yelp search page. We use the net/http library to accomplish this.

    uri = URI.parse(api_url)
    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = true if uri.scheme == 'https'
    
    request = Net::HTTP::Get.new(uri.request_uri, headers)
    
    response = http.request(request)
    

    We'll also check if the request was successful by examining the response status code.

    Step 8: Save the HTML Response

    Now that we've made a successful request, we need to save the HTML response to a file for further analysis. In our code, we create a file named yelp_html.html and write the response body to it.

    File.open("yelp_html.html", "w", encoding: "utf-8") do |file|
      file.write(response.body)
    end
    

    It's important to preserve the data as it is for accuracy and future reference.

    Step 9: Parsing HTML with Nokogiri

    With the Yelp HTML data saved, we can move on to parsing it with Nokogiri. Nokogiri will help us navigate the HTML structure and extract the information we need.

    doc = Nokogiri::HTML(response.body)
    

    Now, let's tackle the next step: extracting business information.

    Step 10: Extract Business Information

    Our goal is to extract details like the business name, rating, number of reviews, price range, and location for each listing on the Yelp page. Let's walk through this step by step.

    Inspecting the page

    When we inspect the page we can see that the div has classes called arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x

    Inside our code, we find all the listings using CSS selectors:

    listings = doc.css('div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x')
    

    These CSS classes are specific to Yelp's HTML structure and may change over time, so make sure they match the current structure.

    Now, we loop through each listing and extract information. We'll assume you've already extracted the information as shown in the original code. Here's a detailed breakdown of each extraction:

  • Business Name: We search for an anchor tag with a specific class and extract the text. If not found, we default to "N/A."
  • business_name_elem = listing.at_css('a.css-19v1rkv')
    business_name = business_name_elem ? business_name_elem.text : "N/A"
    
  • Rating: We search for a specific span element and extract the text. If not found, we default to "N/A."
  • rating_elem = listing.at_css('span.css-gutk1c')
    rating = rating_elem ? rating_elem.text : "N/A"
    
  • Price Range: We search for a specific span element and extract the text. If not found, we default to "N/A."
  • price_range_elem = listing.at_css('span.priceRange__09f24__mmOuH')
    price_range = price_range_elem ? price_range_elem.text : "N/A"
    
  • Number of Reviews and Location: We search for all span elements with a specific class. If there are at least two, we assume the first is for the number of reviews and the second is for the location. We trim the text to remove extra whitespace. If there's only one span, we check if it's a number (for reviews) or a location and assign accordingly.
  • span_elements = listing.css('span.css-chan6m')
    num_reviews = "N/A"
    location = "N/A"
    
    if span_elements.length >= 2
      num_reviews = span_elements[0].text.strip
      location = span_elements[1].text.strip
    elsif span_elements.length == 1
      text = span_elements[0].text.strip
      if text.match?(/^\\d+$/)
        num_reviews = text
      else
        location = text
      end
    end
    

    Now that we've successfully extracted the business information, it's time to move to the next step: printing this information.

    Step 11: Printing Extracted Information

    For each business listing, we print the extracted information using the puts method. Here's how we format and display the data:

    puts "Business Name: #{business_name}"
    puts "Rating: #{rating}"
    puts "Number of Reviews: #{num_reviews}"
    puts "Price Range: #{price_range}"
    puts "Location: #{location}"
    puts "=" * 30
    

    This code ensures that the extracted data is presented clearly.

    Step 12: Error Handling

    While we've covered a lot of ground, it's crucial to handle potential errors gracefully. In case the request to Yelp fails, we check the response status code and print an error message.

    if response.code.to_i == 200
      # Continue with data extraction and printing
    else
      puts "Failed to retrieve data. Status Code: #{response.code}"
    end
    

    This ensures that if something goes wrong, you'll know about it.

    Step 13: Conclusion and Next Steps

    Congratulations! You've successfully scraped Yelp business listings using Ruby and Nokogiri while bypassing Yelp's anti-bot mechanisms with premium proxies. Here are some key takeaways from our journey:

  • Web scraping is a powerful technique for gathering data from websites.
  • Premium proxies are essential to avoid detection and blocking by websites like Yelp.
  • Simulating browser requests with headers helps you scrape data without being flagged as a bot.
  • Nokogiri is a handy library for parsing HTML and extracting information.
  • Now that you have the data, you can use it for various purposes, such as data analysis, visualization, or simply making an informed decision about where to enjoy some delicious Chinese cuisine in San Francisco.

    Full Code:

    require 'net/http'
    require 'uri'
    require 'nokogiri'
    
    # URL of the Yelp search page
    url = "https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA"
    
    # URL-encode the URL
    encoded_url = URI.escape(url, /[:?&=]/)
    
    # API URL with the encoded Yelp URL
    api_url = "http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=#{encoded_url}"
    
    # Define a user-agent header to simulate a browser request
    headers = {
      "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
      "Accept-Language" => "en-US,en;q=0.5",
      "Accept-Encoding" => "gzip, deflate, br",
      "Referer" => "https://www.google.com/",  # Simulate a referrer
    }
    
    # Send an HTTP GET request to the URL with the headers
    uri = URI.parse(api_url)
    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = true if uri.scheme == 'https'
    
    request = Net::HTTP::Get.new(uri.request_uri, headers)
    
    response = http.request(request)
    
    File.open("yelp_html.html", "w", encoding: "utf-8") do |file|
      file.write(response.body)
    end
    
    # Check if the request was successful (status code 200)
    if response.code.to_i == 200
      # Parse the HTML content of the page using Nokogiri
      doc = Nokogiri::HTML(response.body)
    
      # Find all the listings
      listings = doc.css('div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x')
      puts listings.length
    
      # Loop through each listing and extract information
      listings.each do |listing|
        # Assuming you've already extracted the information as shown in your code
    
        # Check if business name exists
        business_name_elem = listing.at_css('a.css-19v1rkv')
        business_name = business_name_elem ? business_name_elem.text : "N/A"
    
        # If business name is not "N/A," then print the information
        if business_name != "N/A"
          # Check if rating exists
          rating_elem = listing.at_css('span.css-gutk1c')
          rating = rating_elem ? rating_elem.text : "N/A"
    
          # Check if price range exists
          price_range_elem = listing.at_css('span.priceRange__09f24__mmOuH')
          price_range = price_range_elem ? price_range_elem.text : "N/A"
    
          # Find all <span> elements inside the listing
          span_elements = listing.css('span.css-chan6m')
    
          # Initialize num_reviews and location as "N/A"
          num_reviews = "N/A"
          location = "N/A"
    
          # Check if there are at least two <span> elements
          if span_elements.length >= 2
            # The first <span> element is for Number of Reviews
            num_reviews = span_elements[0].text.strip
    
            # The second <span> element is for Location
            location = span_elements[1].text.strip
          elsif span_elements.length == 1
            # If there's only one <span> element, check if it's for Number of Reviews or Location
            text = span_elements[0].text.strip
            if text.match?(/^\d+$/)
              num_reviews = text
            else
              location = text
            end
          end
    
          # Print the extracted information
          puts "Business Name: #{business_name}"
          puts "Rating: #{rating}"
          puts "Number of Reviews: #{num_reviews}"
          puts "Price Range: #{price_range}"
          puts "Location: #{location}"
          puts "=" * 30
        end
      end
    else
      puts "Failed to retrieve data. Status Code: #{response.code}"
    end
    

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!