Scraping Yelp Business Listings in NodeJS

Dec 6, 2023 ยท 9 min read

Scenario

Imagine you're a business analyst who wants to collect data on Chinese restaurants in San Francisco from Yelp. You want to extract information such as the business name, rating, number of reviews, price range, and location. This data can be used for market research, competitor analysis, or other business insights.

This is the page we are talking about

Step-by-Step Guide

Step 1: Introduction to Web Scraping and Yelp

Web scraping is the process of extracting data from websites. It's valuable for various purposes, including research, analysis, and data collection. Yelp is a popular platform for finding information about businesses, making it a prime candidate for scraping data like business listings.

To scrape Yelp, we'll use Node.js, Axios for making HTTP requests, and Cheerio for parsing HTML. However, scraping Yelp directly can be challenging due to anti-bot mechanisms in place. This is where premium proxies come in.

Why Premium Proxies?

Yelp employs anti-bot measures to prevent automated scraping. These measures can include IP blocking, CAPTCHAs, and rate limiting. Premium proxies provide a way to bypass these restrictions by routing your requests through different IP addresses, making it harder for Yelp to detect and block your scraping activities.

In the provided code, we'll utilize premium proxies from ProxiesAPI to access Yelp data without interruptions.

Step 2: Setting Up Your Environment

Before diving into the code, you need to set up your environment. Here are the steps:

Installation:

  1. Node.js: If you haven't already, install Node.js from the official website (https://nodejs.org/). This will allow you to run JavaScript code on your machine.
  2. Axios: Install Axios, a library for making HTTP requests, by running the following command in your terminal:
  3. Cheerio: Install Cheerio, a library for parsing HTML, using the following command:

Obtaining a ProxiesAPI Key:

  1. ProxiesAPI Key: Go to the ProxiesAPI website (https://proxiesapi.com/) and sign up for an account. Once registered, you'll receive an API key, which is essential for using premium proxies.

With your environment set up and the API key in hand, you're ready to proceed with the code.

Step 3: Understanding the Code

Now, let's break down the code provided in the beginning. We'll go through it step by step, explaining its purpose and functionality.

Importing Libraries:

The code starts by importing the necessary libraries: Axios for making HTTP requests and Cheerio for parsing HTML.

const axios = require('axios');
const cheerio = require('cheerio');

Defining the Yelp URL:

Next, we define the URL of the Yelp search page for Chinese restaurants in San Francisco:

const url = "<https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA>";

Encoding the URL:

To ensure the URL is properly formatted for use in an HTTP request, we encode it:

const encodedUrl = encodeURIComponent(url

);

Constructing the API URL:

We construct the API URL for ProxiesAPI by combining the encoded Yelp URL and your API key:

const apiKey = "YOUR_API_KEY";
const apiURL = `http://api.proxiesapi.com/?premium=true&auth_key=${apiKey}&url=${encodedUrl}`;

Simulating a Browser Request:

To mimic a browser request, we define user-agent headers that include details like the browser's user agent, accepted languages, and more:

const headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": "<https://www.google.com/>",  // Simulate a referrer
};

Sending an HTTP GET Request:

The core of the code is an HTTP GET request to the ProxiesAPI URL with the headers we defined:

axios.get(apiURL, { headers })
    .then(response => {
        // Check if the request was successful (status code 200)
        if (response.status === 200) {
            // ... (Code for parsing and extracting data)
        } else {
            console.log(`Failed to retrieve data. Status Code: ${response.status}`);
        }
    })
    .catch(error => {
        console.error(error);
    });

At this point, we have successfully set up our environment and prepared the code to make requests to Yelp through premium proxies. In the next steps, we'll dive into how the code extracts business listings from Yelp.

Step 4: Extracting Business Listings

The code uses Cheerio to parse the HTML content of the Yelp page and extract information from business listings. Let's explore how this works.

Loading HTML with Cheerio:

Inside the then block of the Axios request, we load the HTML content into Cheerio:

const $ = cheerio.load(response.data);

Finding Listings:

Yelp's listings are contained within specific HTML elements. We use Cheerio to find all the listings and store them in the listings variable:

Inspecting the page

When we inspect the page we can see that the div has classes called arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x

const listings = $('div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x');

Looping Through Listings:

We then loop through each listing using the each function:

listings.each((index, element) => {
    // ... (Code for extracting and printing information)
});

Extracting Information:

Inside the loop, we extract information such as the business name, rating, price range, number of reviews, and location from each listing:

const businessNameElem = $(element).find('a.css-19v1rkv');
const businessName = businessNameElem.text().trim() || "N/A";

// Similar code for extracting rating, price range, number of reviews, and location

Printing Information:

Finally, we print the extracted information for each business listing:

console.log("Business Name:", businessName);
console.log("Rating:", rating);
console.log("Number of Reviews:", numReviews);
console.log("Price Range:", priceRange);
console.log("Location:", location);
console.log("===============================");

The code continues to loop through all the listings, extracting and printing information until it processes all the listings on the Yelp page.

Step 5: Handling Data

At this point, you have successfully scraped business information from Yelp. Depending on your needs, you can handle the data in various ways:

  • Save to a File: You can modify the code to save the extracted data to a file, such as a CSV or JSON file, for further analysis.
  • Data Analysis: You can perform data analysis on the extracted information, such as calculating average ratings, identifying trends, or comparing businesses.
  • Database Storage: If you have a database, you can store the scraped data in it for easy retrieval and analysis.
  • Step 6: Challenges and Considerations

    Web scraping can present challenges, and it's important to be aware of them:

  • Website Structure Changes: Websites like Yelp frequently update their structure, which can break your scraping code. You'll need to monitor your code for potential errors due to changes and update it accordingly.
  • CAPTCHAs: Some websites may present CAPTCHAs to detect and block automated scraping. In such cases, consider using CAPTCHA-solving services or human interaction to overcome this challenge.
  • Ethical Considerations: Always respect a website's terms of service and robots.txt file. Avoid overloading the website with requests and ensure your scraping activities are ethical.
  • Step 7: Conclusion

    In this article, we've walked you through the process of scraping Yelp business listings, from setting up your environment and understanding the code to extracting and handling data. Web scraping can be a powerful tool for gathering information, but it's important to use it responsibly and ethically.

    Remember that websites may have terms of service that prohibit or restrict scraping, so always check and adhere to their policies. Additionally, be prepared to adapt your code as websites evolve.

    Happy scraping, and may your data analysis endeavors be fruitful!

    Full Code

    const axios = require('axios');
    const cheerio = require('cheerio');
    
    // URL of the Yelp search page
    const url = "https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA";
    
    // Encode the URL
    const encodedUrl = encodeURIComponent(url);
    
    // Define your API key
    const apiKey = "YOUR_API_KEY";
    
    // Construct the API URL with the encoded Walmart URL
    const apiURL = `http://api.proxiesapi.com/?premium=true&auth_key=${apiKey}&url=${encodedUrl}`;
    
    // Define a user-agent header to simulate a browser request
    const headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": "https://www.google.com/",  // Simulate a referrer
    };
    
    // Send an HTTP GET request to the API URL with the headers
    axios.get(apiURL, { headers })
        .then(response => {
            if (response.status === 200) {
                // Load the HTML content into cheerio
                const $ = cheerio.load(response.data);
    
                // Find all the listings
                const listings = $('div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x');
                console.log(listings.length);
                
                // Loop through each listing and extract information
                listings.each((index, element) => {
                    // Extract business name
                    const businessNameElem = $(element).find('a.css-19v1rkv');
                    const businessName = businessNameElem.text().trim() || "N/A";
    
                    // If business name is not "N/A," then print the information
                    if (businessName !== "N/A") {
                        // Extract rating
                        const ratingElem = $(element).find('span.css-gutk1c');
                        const rating = ratingElem.text().trim() || "N/A";
    
                        // Extract price range
                        const priceRangeElem = $(element).find('span.priceRange__09f24__mmOuH');
                        const priceRange = priceRangeElem.text().trim() || "N/A";
    
                        // Extract number of reviews and location
                        const spanElements = $(element).find('span.css-chan6m');
                        let numReviews = "N/A";
                        let location = "N/A";
    
                        // Check if there are at least two span elements
                        if (spanElements.length >= 2) {
                            numReviews = $(spanElements[0]).text().trim();
                            location = $(spanElements[1]).text().trim();
                        } else if (spanElements.length === 1) {
                            // If there's only one span element, check if it's for Number of Reviews or Location
                            const text = $(spanElements[0]).text().trim();
                            if (!isNaN(text)) {
                                numReviews = text;
                            } else {
                                location = text;
                            }
                        }
    
                        // Print the extracted information
                        console.log(`Business Name: ${businessName}`);
                        console.log(`Rating: ${rating}`);
                        console.log(`Number of Reviews: ${numReviews}`);
                        console.log(`Price Range: ${priceRange}`);
                        console.log(`Location: ${location}`);
                        console.log("===============================");
                    }
                });
            } else {
                console.log(`Failed to retrieve data. Status Code: ${response.status}`);
            }
        })
        .catch(error => {
            console.error(error);
        });

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: