Scraping Yelp Business Listings with Scala

Dec 6, 2023 · 8 min read

Introduction

Are you eager to dive into web scraping and unlock the treasure trove of data within Yelp business listings? This step-by-step guide is tailor-made for beginners, offering in-depth explanations and invaluable insights. We'll use Scala to extract data from Yelp, and you can apply these techniques to various other websites as well.

This is the page we are talking about

Prerequisites

Before we embark on our data extraction journey, ensure you have the following prerequisites in place:

  1. Scala: Make sure you have Scala installed on your system. If not, you can download it here.
  2. ProxiesAPI: This code employs ProxiesAPI to bypass Yelp's anti-bot measures. You'll need a premium ProxiesAPI account. Sign up here and obtain your API key.

Plan of Action

Here's our comprehensive plan to extract data from Yelp business listings:

  1. URL Encoding: We'll start by encoding the Yelp search URL to handle special characters properly.
  2. Simulate Browser Request: To avoid detection as a bot by Yelp, we'll send an HTTP GET request with simulated browser headers.
  3. Save HTML Content: We'll save the HTML content to a file for further analysis.
  4. HTML Parsing: We'll use Jsoup, a powerful HTML parsing library, to parse the HTML content.
  5. Data Extraction: The heart of the process – we'll extract relevant information from the HTML.

Now, let's delve into each step with detailed explanations.

Step 1: URL Encoding

The very first step is to ensure our Yelp search URL is properly encoded. This is crucial to handle special characters and format the URL correctly for use in the API request.

val url = "<https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA>"
val encodedUrl = java.net.URLEncoder.encode(url, "UTF-8")

Step 2: Simulating a Browser Request

Yelp employs anti-bot mechanisms, so we'll simulate a browser request by sending an HTTP GET request with headers. This is essential to avoid detection.

val headers = Map(
  "User-Agent" -> "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
  "Accept-Language" -> "en-US,en;q=0.5",
  "Accept-Encoding" -> "gzip, deflate, br",
  "Referer" -> "<https://www.google.com/>"
)

val response = Http(apiUrl).headers(headers).option(HttpOptions.allowUnsafeSSL).asString

Step 3: Saving HTML Content

Before we proceed, we save the HTML content to a file. This step ensures we have a local copy for analysis and future reference.

val htmlContent = response.body
val outputFile = new File("yelp_html.html")
val writer = new PrintWriter(outputFile)
writer.write(htmlContent)
writer.close()

Step 4: HTML Parsing

Now, it's time to bring in Jsoup, a popular HTML parsing library. We'll parse the HTML content to make it accessible and manipulable.

val document: Document = Jsoup.parse(htmlContent)

Step 5: Data Extraction

The highlight of our journey – data extraction. This is where we unearth valuable information from the HTML content. Let's break down the process:

Inspecting the page

When we inspect the page we can see that the div has classes called arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x

val listings: Elements = document.select("div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x")

The listings variable now holds a collection of HTML elements representing individual business listings on Yelp's page.

For each listing, we extract details like business name, rating, number of reviews, price range, and location. Here's how:

  • Business Name: We find the business name element and extract its text.
  • val businessNameElem: Element = listing.selectFirst("a.css-19v1rkv")
    val businessName: String = if (businessNameElem != null) businessNameElem.text() else "N/A"
    
  • Rating: Similarly, we locate and extract the rating.
  • val ratingElem: Element = listing.selectFirst("span.css-gutk1c")
    val rating: String = if (ratingElem != null) ratingElem.text() else "N/A"
    
  • Price Range: Extracting the price range, if available.
  • val priceRangeElem: Element = listing.selectFirst("span.priceRange__09f24__mmOuH")
    val priceRange: String = if (priceRangeElem != null) priceRangeElem.text() else "N/A"
    
  • Number of Reviews and Location: Here, we extract either the number of reviews or the location, depending on their availability.
  • val spanElements: Elements = listing.select("span.css-chan6m")
    
    var numReviews: String = "N/A"
    var location: String = "N/A"
    
    // Check for at least two <span> elements
    if (spanElements.size() >= 2) {
        numReviews = spanElements.get(0).text().trim
        location = spanElements.get(1).text().trim
    } else if (spanElements.size() == 1) {
        // Handle cases with only one <span> element
        val text: String = spanElements.get(0).text().trim
        if (text.forall(_.isDigit)) {
            numReviews = text
        } else {
            location = text
        }
    }
    

    Practical Considerations and Challenges

    1. Premium Proxies: Premium ProxiesAPI is crucial to bypass Yelp's anti-bot mechanisms effectively. Free proxies might not work.
    2. Selector Clarification: Selectors can be puzzling for newcomers. In this code, we've retained the original selectors without any changes. They serve as patterns to locate specific HTML elements.

    Main Takeaways

  • Encoding URLs and simulating browser requests are fundamental for web scraping.
  • Jsoup simplifies HTML parsing in Scala, making data extraction a breeze.
  • Selectors are like search patterns that help locate and extract data from HTML elements.
  • Premium proxies are your best bet when scraping Yelp or similar websites.
  • Next Steps

    Having successfully extracted data from Yelp listings, you can now broaden your web scraping knowledge. Explore advanced techniques, tackle pagination, or automate data storage for more complex projects.

    Always respect websites' terms of service and adhere to ethical and legal guidelines when scraping.

    Here's the complete code for your reference:

    import java.io.{File, PrintWriter}
    import org.jsoup.Jsoup
    import org.jsoup.nodes.{Document, Element}
    import org.jsoup.select.Elements
    import scalaj.http.{Http, HttpOptions}
    
    object YelpScraper {
      def main(args: Array[String]): Unit = {
        // URL of the Yelp search page
        val url = "https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA"
    
        // URL-encode the URL
        val encodedUrl = java.net.URLEncoder.encode(url, "UTF-8")
    
        // API URL with the encoded Yelp URL
        val apiUrl = s"http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=$encodedUrl"
    
        // Define headers to simulate a browser request
        val headers = Map(
          "User-Agent" -> "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
          "Accept-Language" -> "en-US,en;q=0.5",
          "Accept-Encoding" -> "gzip, deflate, br",
          "Referer" -> "https://www.google.com/"
        )
    
        // Send an HTTP GET request to the URL with the headers
        val response = Http(apiUrl).headers(headers).option(HttpOptions.allowUnsafeSSL).asString
    
        // Check if the request was successful (status code 200)
        if (response.isSuccess) {
          // Save the HTML content to a file
          val htmlContent = response.body
          val outputFile = new File("yelp_html.html")
          val writer = new PrintWriter(outputFile)
          writer.write(htmlContent)
          writer.close()
    
          // Parse the HTML content using Jsoup
          val document: Document = Jsoup.parse(htmlContent)
    
          // Find all the listings
          val listings: Elements = document.select("div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x")
    
          println(s"Number of Listings: ${listings.size()}")
    
          // Loop through each listing and extract information
          listings.forEach { listing: Element =>
            // Assuming you've already extracted the information as shown in your code
    
            // Check if business name exists
            val businessNameElem: Element = listing.selectFirst("a.css-19v1rkv")
            val businessName: String = if (businessNameElem != null) businessNameElem.text() else "N/A"
    
            // If business name is not "N/A," then print the information
            if (businessName != "N/A") {
              // Check if rating exists
              val ratingElem: Element = listing.selectFirst("span.css-gutk1c")
              val rating: String = if (ratingElem != null) ratingElem.text() else "N/A"
    
              // Check if price range exists
              val priceRangeElem: Element = listing.selectFirst("span.priceRange__09f24__mmOuH")
              val priceRange: String = if (priceRangeElem != null) priceRangeElem.text() else "N/A"
    
              // Find all <span> elements inside the listing
              val spanElements: Elements = listing.select("span.css-chan6m")
    
              // Initialize numReviews and location as "N/A"
              var numReviews: String = "N/A"
              var location: String = "N/A"
    
              // Check if there are at least two <span> elements
              if (spanElements.size() >= 2) {
                // The first <span> element is for Number of Reviews
                numReviews = spanElements.get(0).text().trim
    
                // The second <span> element is for Location
                location = spanElements.get(1).text().trim
              } else if (spanElements.size() == 1) {
                // If there's only one <span> element, check if it's for Number of Reviews or Location
                val text: String = spanElements.get(0).text().trim
                if (text.forall(_.isDigit)) {
                  numReviews = text
                } else {
                  location = text
                }
              }
    
              // Print the extracted information
              println(s"Business Name: $businessName")
              println(s"Rating: $rating")
              println(s"Number of Reviews: $numReviews")
              println(s"Price Range: $priceRange")
              println(s"Location: $location")
              println("=" * 30)
            }
          }
        } else {
          println(s"Failed to retrieve data. Status Code: ${response.code}")
        }
      }
    }

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!