Scraping Yelp Business Listings in Kotlin

Dec 6, 2023 ยท 6 min read

Yelp is a popular platform for discovering and reviewing local businesses. Often people want to extract data from Yelp listings for further analysis - number of reviews, ratings, price ranges etc.

This guide will walk you through a full code sample for scraping key data points from Yelp listings using Kotlin. We'll specifically focus on restaurants in San Francisco.

This is the page we are talking about

Importing Required Libraries

We need a few libraries to send HTTP requests and parse the responses:

import okhttp3.OkHttpClient
import okhttp3.Request
import org.jsoup.Jsoup

OkHttpClient sends requests and gets responses. Request constructs the API calls. Jsoup parses HTML from the responses.

Constructing the Yelp URL

Let's search for Chinese restaurants in San Francisco:

val url = "<https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA>"

This URL will display our listings.

Encoding the URL

We need to encode this URL before sending to the API:

val encodedUrl = java.net.URLEncoder.encode(url, "UTF-8")

Encoding converts characters like spaces into %20. This format is required for sending data over HTTP.

Setting Request Headers

We need to mimic a browser's headers to bypass bot detection:

val headers = mapOf(
    "User-Agent" to "Mozilla/5.0...",
    "Accept-Language" to "en-US,en;q=0.5",
    "Accept-Encoding" to "gzip, deflate, br",
    "Referer" to "<https://www.google.com/>"
)

The headers make it seem like a real browser is sending the request.

Creating the HTTP Client

Let's create the OkHttpClient:

val client = OkHttpClient()

This will handle sending requests and receiving responses for us.

Building the API Request

We'll use ProxiesAPI to proxy our requests:

val apiKey = "YOUR_AUTH_KEY"

val apiEndpoint = "<http://api.proxiesapi.com/?premium=true&auth_key=$apiKey&url=$encodedUrl>"

val request = Request.Builder()
    .url(apiEndpoint)
    .headers(headers)
    .build()

ProxiesAPI allows bypassing Yelp's bot protection with residential proxies. We pass auth_key, encoded URL and headers to build the request.

Sending the Request

Let's send an HTTP GET request:

val response = client.newCall(request).execute()

This will send the request and store the API response.

Parsing the Response with Jsoup

We can extract data by parsing the HTML:

val html = response.body?.string()
val document = Jsoup.parse(html)

Jsoup takes the HTML content and creates a document model for traversal.

Extracting Listing Data

Inspecting the page

When we inspect the page we can see that the div has classes called arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x

Let's find all listings:

val listings = document.select("div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x")

This selector specifically targets the containers of individual listings on the page.

We loop through each listing container:

for (listing in listings) {

    // Extract data from each listing

}

Inside, we can use additional selectors to extract information:

Business Name

val businessNameElem = listing.selectFirst("a.css-19v1rkv")
val businessName = businessNameElem?.text() ?: "N/A"

This selector finds the business name anchor tag. We fallback to "N/A" if not found.

Rating

val ratingElem = listing.selectFirst("span.css-gutk1c")
val rating = ratingElem?.text() ?: "N/A"

The rating is inside a specific span. Again fallback to "N/A".

And similarly for price range, number of reviews, location etc. Each data is wrapped in specific tags we target with selectors.

Finally, we print out the extracted information:

println("Business Name: $businessName")
println("Rating: $rating")
// etc

Full code:

import okhttp3.OkHttpClient
import okhttp3.Request
import org.jsoup.Jsoup

fun main() {
    // URL of the Yelp search page
    val url = "https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA"

    // Encode the URL
    val encodedUrl = java.net.URLEncoder.encode(url, "UTF-8")

    // API URL with the encoded Walmart URL
    val apiKey = "YOUR_AUTH_KEY"
    val apiEndpoint = "http://api.proxiesapi.com/?premium=true&auth_key=$apiKey&url=$encodedUrl"

    // Define user-agent header to simulate a browser request
    val headers = mapOf(
        "User-Agent" to "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
        "Accept-Language" to "en-US,en;q=0.5",
        "Accept-Encoding" to "gzip, deflate, br",
        "Referer" to "https://www.google.com/"
    )

    // Create an OkHttpClient instance
    val client = OkHttpClient()

    // Build the request with headers
    val request = Request.Builder()
        .url(apiEndpoint)
        .headers(headers.map { (key, value) -> key to value })
        .build()

    // Send an HTTP GET request
    val response = client.newCall(request).execute()

    // Check if the request was successful (status code 200)
    if (response.isSuccessful) {
        // Parse the HTML content of the page using Jsoup
        val html = response.body?.string()
        val document = Jsoup.parse(html)

        // Find all the listings
        val listings = document.select("div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x")
        println(listings.size)

        // Loop through each listing and extract information
        for (listing in listings) {
            // Assuming you've already extracted the information as shown in your Python code

            // Check if business name exists
            val businessNameElem = listing.selectFirst("a.css-19v1rkv")
            val businessName = businessNameElem?.text() ?: "N/A"

            // If business name is not "N/A," then print the information
            if (businessName != "N/A") {
                // Check if rating exists
                val ratingElem = listing.selectFirst("span.css-gutk1c")
                val rating = ratingElem?.text() ?: "N/A"

                // Check if price range exists
                val priceRangeElem = listing.selectFirst("span.priceRange__09f24__mmOuH")
                val priceRange = priceRangeElem?.text() ?: "N/A"

                // Find all <span> elements inside the listing
                val spanElements = listing.select("span.css-chan6m")

                // Initialize num_reviews and location as "N/A"
                var numReviews = "N/A"
                var location = "N/A"

                // Check if there are at least two <span> elements
                if (spanElements.size >= 2) {
                    // The first <span> element is for Number of Reviews
                    numReviews = spanElements[0].text().trim()

                    // The second <span> element is for Location
                    location = spanElements[1].text().trim()
                } else if (spanElements.size == 1) {
                    // If there's only one <span> element, check if it's for Number of Reviews or Location
                    val text = spanElements[0].text().trim()
                    if (text.matches("\\d+".toRegex())) {
                        numReviews = text
                    } else {
                        location = text
                    }
                }

                // Print the extracted information
                println("Business Name: $businessName")
                println("Rating: $rating")
                println("Number of Reviews: $numReviews")
                println("Price Range: $priceRange")
                println("Location: $location")
                println("=" * 30)
            }
        }
    } else {
        println("Failed to retrieve data. Status Code: ${response.code}")
    }
}

Key Takeaways

  • Used specific selectors to target data elements
  • Handled missing data with fallback values
  • Needed headers and proxy API to bypass bot protection
  • Jsoup parsed HTML responses into document model
  • Printed extracted business information
  • Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: