Web Scraping Yelp Business Listings with Rust

Dec 6, 2023 ยท 7 min read

Introduction

In today's data-driven world, extracting information from websites can provide valuable insights for various purposes, from market research to building recommendation systems. In this beginner-friendly guide, we'll walk you through the process of web scraping Yelp business listings using the Rust programming language. We'll cover everything from setting up your development environment to extracting and displaying business details.

This is the page we are talking about

Prerequisites

Before we dive into the code, let's ensure you have everything you need:

  1. Rust Installed: Make sure you have Rust installed on your system. If not, you can get it from the official Rust website.
  2. Dependencies: Install the required Rust dependencies using Cargo, Rust's package manager. You'll need reqwest, scraper, and urlencoding. You can add them to your Cargo.toml file like this:
  3. ProxiesAPI Premium Account: To bypass Yelp's anti-bot mechanisms, you'll need a valid ProxiesAPI premium account and API key. Premium proxies are essential for this task, as they provide a level of anonymity and reliability that free proxies cannot guarantee.

Now, let's move on to the code explanation.

Code Explanation

Before we jump into the code, let's briefly explain what each library does and understand the structure of Yelp's business listings page.

  • reqwest: This library allows us to make HTTP requests to websites. We will use it to fetch the Yelp page.
  • scraper: Scraper is a Rust crate for parsing and querying HTML documents. It helps us extract data from the HTML content.
  • urlencoding: This library is used to properly encode the URL.
  • Now, let's explore Yelp's business listings structure. Yelp's business listings are typically structured with various elements for each business, including the business name, rating, price range, and more. We'll be targeting these elements in our code.

    <div class="arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x">
        <a class="css-19v1rkv">Business Name</a>
        <span class="css-gutk1c">Rating</span>
        <span class="priceRange__09f24__mmOuH">Price Range</span>
        <span class="css-chan6m">Number of Reviews</span>
        <span class="css-chan6m">Location</span>
    </div>
    

    Now, let's break down the code into step-by-step instructions.

    Step-by-Step Guide

    Step 1: Import Dependencies

    use reqwest;
    use scraper::{Html, Selector};
    use urlencoding::encode;
    
    #[tokio::main]
    async fn main() {
        // Rest of the code goes here
    }
    

    Step 2: Set the Yelp URL

    let url = "<https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA>";
    

    In this step, we set the target URL to search for Chinese restaurants in San Francisco.

    Step 3: Encode the URL

    let encoded_url = encode(url);
    

    URL encoding is essential to handle special characters properly in the URL.

    Step 4: Prepare the API URL

    let api_url = format!("<http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url={}>", encoded_url);
    

    We create the API URL by incorporating your ProxiesAPI premium authentication key. This URL will be used to route our request through premium proxies.

    Step 5: Create a Reqwest Client

    let client = reqwest::Client::new();
    

    We create a Reqwest client, which will be responsible for making HTTP requests. We also set headers to mimic a real browser's request.

    Step 6: Send the Request

    let res = client.get(&api_url)
        .header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")
        .header("Accept-Language", "en-US,en;q=0.5")
        .header("Accept-Encoding", "gzip, deflate, br")
        .header("Referer", "<https://www.google.com/>")
        .send()
        .await;
    

    In this step, we send an HTTP GET request using the client we created earlier. We also set headers to make our request look like it's coming from a legitimate browser.

    Step 7: Parse HTML

    if response.status().is_success() {
        let body = response.text().await.unwrap();
        parse_html(&body);
    } else {
        println!("Failed to retrieve data. Status Code: {:?}", response.status());
    }
    

    We check if the response is successful and then parse the HTML content using the scraper library.

    Step 8: Iterate through Listings

    Inspecting the page

    When we inspect the page we can see that the div has classes called arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x

    let listings_selector = Selector::parse("div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x").unwrap();
    
    for listing in listings {
        // Extract business details
        // Print the details to the console
    }
    

    We iterate through the scraped listings, targeting specific elements using selectors, and extract business details such as the name, rating, price range, number of reviews, and location.

    Step 9: Display Results

    println!("Business Name: {}", business_name);
    println!("Rating: {}", rating);
    println!("Number of Reviews: {}", num_reviews);
    println!("Price Range: {}", price_range);
    println!("Location: {}", location);
    println!("==============================");
    

    Finally, we print the extracted information to the console.

    Next Steps and Further Learning

    Web scraping opens up a world of possibilities for data gathering and analysis. Here are some next steps and resources for further learning:

  • Data Storage: Learn how to save scraped data to a database or a file for future analysis.
  • Building Web Scraping Projects: Create more complex web scraping projects to enhance your skills.
  • Ethical Considerations: Always scrape responsibly and respect website terms of service and robots.txt files.
  • Rust and Web Scraping: Dive deeper into Rust and explore more advanced web scraping techniques.
  • Here is the full code:

    use reqwest;
    use scraper::{Html, Selector};
    use urlencoding::encode;
    
    #[tokio::main]
    async fn main() {
        let url = "https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA";
        let encoded_url = encode(url);
        let api_url = format!("http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url={}", encoded_url);
    
        let client = reqwest::Client::new();
        let res = client.get(&api_url)
            .header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")
            .header("Accept-Language", "en-US,en;q=0.5")
            .header("Accept-Encoding", "gzip, deflate, br")
            .header("Referer", "https://www.google.com/")
            .send()
            .await;
    
        match res {
            Ok(response) => {
                if response.status().is_success() {
                    let body = response.text().await.unwrap();
                    parse_html(&body);
                } else {
                    println!("Failed to retrieve data. Status Code: {:?}", response.status());
                }
            },
            Err(e) => println!("Request failed: {:?}", e),
        }
    }
    
    fn parse_html(html: &str) {
        let document = Html::parse_document(html);
        let listings_selector = Selector::parse("div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x").unwrap();
        let business_name_selector = Selector::parse("a.css-19v1rkv").unwrap();
        let rating_selector = Selector::parse("span.css-gutk1c").unwrap();
        let price_range_selector = Selector::parse("span.priceRange__09f24__mmOuH").unwrap();
        let span_selector = Selector::parse("span.css-chan6m").unwrap();
    
        let listings = document.select(&listings_selector);
    
        for listing in listings {
            let business_name = listing.select(&business_name_selector).next().map(|e| e.inner_html()).unwrap_or("N/A".to_string());
            let rating = listing.select(&rating_selector).next().map(|e| e.inner_html()).unwrap_or("N/A".to_string());
            let price_range = listing.select(&price_range_selector).next().map(|e| e.inner_html()).unwrap_or("N/A".to_string());
            
            let span_elements: Vec<_> = listing.select(&span_selector).collect();
            let num_reviews = match span_elements.get(0) {
                Some(element) => element.inner_html().trim().to_string(),
                None => "N/A".to_string(),
            };
            let location = match span_elements.get(1) {
                Some(element) => element.inner_html().trim().to_string(),
                None => "N/A".to_string(),
            };
    
            println!("Business Name: {}", business_name);
            println!("Rating: {}", rating);
            println!("Number of Reviews: {}", num_reviews);
            println!("Price Range: {}", price_range);
            println!("Location: {}", location);
            println!("==============================");
        }
    }

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!