Scraping Real Estate Listings From Realtor in R

Jan 9, 2024 · 6 min read

Getting Started

To follow along with this code, you'll need to have R installed along with the rvest and stringr packages. Here is the install code:

install.packages("rvest")
install.packages("stringr")

Now let's jump right into the code...

Overview

This script does the following high-level steps:

  1. Defines the URL to scrape (Realtor.com for San Francisco)
  2. Sends a request to that URL
  3. Checks if the request succeeded
  4. Finds all real estate listing blocks on the page
  5. Loops through each listing block to extract key details

Next we'll break down that last step of extracting data from each listing...

This is the listings page we are talking about…

Extracting Listing Data

The most complex part of this web scraping script is finding and extracting the specific data points about each real estate listing from the HTML.

Inspecting the element

When we inspect element in Chrome we can see that each of the listing blocks is wrapped in a div with a class value as shown below…

Within the loop through each listing block, different HTML selectors are used to locate specific elements and extract their text. Let's break this down selector-by-selector:

Broker Name

The broker name is extracted with this selector:

div.BrokerTitle_brokerTitle___ZkbBW

This finds the outer

tag with a class name that includes BrokerTitle_brokerTitle. Within that is a tag containing the actual text of the broker's name.

The R code uses this selector and extracts the name text like:

broker_info <- html_node(listing_block, "div.BrokerTitle_brokerTitle___ZkbBW")

broker_name <- str_trim(html_text(html_node(broker_info, "span.BrokerTitle_titleText___20u1P")))

It first finds that broker

, then within that finds the name , and extracts the text after trimming whitespace.

Status

The status (e.g. "For Sale") is extracted using:

div.message

This selector finds the

with class message that contains the status text.

The R code extracts it simply:

status <- str_trim(html_text(html_node(listing_block, "div.message")))

Finding that status

and getting its text content.

Price

The price selector is:

div.card-price

Which locates the

with class card-price containing the price text.

Extracted via:

price <- str_trim(html_text(html_node(listing_block, "div.card-price")))

Beds, Baths etc.

The key details like beds and baths use slightly more advanced selectors like:

li[data-testid='property-meta-beds']

This finds

  • tags with a data-testid attribute matching property-meta-beds. This attribute happens to contain the relevant metadata.

    The code checks if these elements exist before extracting their text:

    beds_element <- html_node(listing_block, "li[data-testid='property-meta-beds']")
    
    beds <- ifelse(!is.null(beds_element), str_trim(html_text(beds_element)), "N/A")
    

    First finding that beds

  • , then checking if it exists, and if so extracting the text, otherwise setting beds to "N/A".

    This pattern is followed for baths, square footage, lot size, and any other metadata elements.

    Address

    Finally, the address selector is simple:

    div.card-address
    

    Finding the address

    to extract the text from:

    address <- str_trim(html_text(html_node(listing_block, "div.card-address")))
    

    And that covers all the key fields extracted for each listing! As you can see it relies heavily on using selectors to pinpoint very specific DOM elements and extract just the needed text.

    Full Code

    For reference, here is the full runnable scraper code:

    # Load necessary libraries
    library(rvest)
    library(stringr)
    
    # Define the URL of the Realtor.com search page
    url <- "https://www.realtor.com/realestateandhomes-search/San-Francisco_CA"
    
    # Define a User-Agent header
    user_agent <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    
    # Send a GET request to the URL with the User-Agent header
    page <- read_html(url, user_agent(user_agent))
    
    # Check if the request was successful (status code 200)
    if (http_status(page)$status_code == 200) {
      # Find all the listing blocks using the provided class name
      listing_blocks <- html_nodes(page, "div.BasePropertyCard_propertyCardWrap__J0xUj")
    
      # Loop through each listing block and extract information
      for (listing_block in listing_blocks) {
        # Extract the broker information
        broker_info <- html_node(listing_block, "div.BrokerTitle_brokerTitle__ZkbBW")
        broker_name <- str_trim(html_text(html_node(broker_info, "span.BrokerTitle_titleText__20u1P")))
    
        # Extract the status (e.g., For Sale)
        status <- str_trim(html_text(html_node(listing_block, "div.message")))
    
        # Extract the price
        price <- str_trim(html_text(html_node(listing_block, "div.card-price")))
    
        # Extract other details like beds, baths, sqft, and lot size
        beds_element <- html_node(listing_block, "li[data-testid='property-meta-beds']")
        baths_element <- html_node(listing_block, "li[data-testid='property-meta-baths']")
        sqft_element <- html_node(listing_block, "li[data-testid='property-meta-sqft']")
        lot_size_element <- html_node(listing_block, "li[data-testid='property-meta-lot-size']")
    
        # Check if the elements exist before extracting their text
        beds <- ifelse(!is.null(beds_element), str_trim(html_text(beds_element)), "N/A")
        baths <- ifelse(!is.null(baths_element), str_trim(html_text(baths_element)), "N/A")
        sqft <- ifelse(!is.null(sqft_element), str_trim(html_text(sqft_element)), "N/A")
        lot_size <- ifelse(!is.null(lot_size_element), str_trim(html_text(lot_size_element)), "N/A")
    
        # Extract the address
        address <- str_trim(html_text(html_node(listing_block, "div.card-address")))
    
        # Print the extracted information
        cat("Broker:", broker_name, "\n")
        cat("Status:", status, "\n")
        cat("Price:", price, "\n")
        cat("Beds:", beds, "\n")
        cat("Baths:", baths, "\n")
        cat("Sqft:", sqft, "\n")
        cat("Lot Size:", lot_size, "\n")
        cat("Address:", address, "\n")
        cat("-" * 50, "\n")  # Separating listings
      }
    } else {
      cat("Failed to retrieve the page. Status code:", http_status(page)$status_code, "\n")
    }

  • Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!