Scraping Real Estate Listings From Realtor with Ruby

Jan 9, 2024 · 5 min read

Web scraping can be a valuable skill for extracting publicly available data from websites. This article will explain how to use Ruby and the Nokogiri and HTTParty gems to scrape real estate listing data from Realtor.com.

This is the listings page we are talking about…

Installation

Before running the code, you'll need to install Ruby, Nokogiri, and HTTParty:

# Install Ruby
$ sudo apt install ruby-full

# Install bundler
$ gem install bundler

# Install Nokogiri
$ gem install nokogiri

# Install HTTParty
$ gem install httparty

When installed correctly, you should see version information when checking:

$ ruby -v
ruby 2.7.1p83

$ gem list | grep nokogiri
nokogiri (1.13.10)

$ gem list | grep httparty
httparty (0.20.0)

Now we're ready to run the scraper!

Understanding the Script Flow

Let's break down what the script is doing at a high level:

  1. Require the nokogiri and httparty libraries we installed
  2. Define the target URL to scrape (https://www.realtor.com/...)
  3. Set a user agent header to look like a real web browser
  4. Make an HTTP GET request to fetch the page content
  5. Check if the request succeeded (status code 200)
  6. If successful, parse the HTML content using Nokogiri
  7. Find all listing blocks by their CSS class name
  8. Loop through each listing block
  9. Extract data like price, beds, broker name etc. by CSS selector
  10. Print out the extracted data

The key thing to understand is that first we fetch the page HTML, then we use CSS selectors to pinpoint specific data elements in that HTML and extract their text.

Next let's see exactly how the data extraction works.

Extracting Data from Listing Blocks

Inspecting the element

When we inspect element in Chrome we can see that each of the listing blocks is wrapped in a div with a class value as shown below…

The most complex part of this scraper is extracting multiple data fields from each listing block.

Here is the loop that processes each listing:

listing_blocks.each do |listing_block|

  # Extract the broker information
  broker_info = listing_block.css(".BrokerTitle_brokerTitle__ZkbBW")
  broker_name = broker_info.css(".BrokerTitle_titleText__20u1P").text.strip()

  # Extract price
  price = listing_block.css(".card-price").text.strip()

  # Extract beds, baths
  beds_element = listing_block.css("[data-testid='property-meta-beds']")
  baths_element = listing_block.css("[data-testid='property-meta-baths']")

  # Extract address
  address = listing_block.css(".card-address").text.strip()
end

Let's look at how the broker name field is extracted:

broker_info = listing_block.css(".BrokerTitle_brokerTitle__ZkbBW")
broker_name = broker_info.css(".BrokerTitle_titleText__20u1P").text.strip()

First, we use the CSS selector .BrokerTitle_brokerTitle__ZkbBW to find the block containing broker info. Then we drill down one level further with .BrokerTitle_titleText__20u1P to pinpoint just the broker name text element. Finally, we call .text to extract the text content and .strip() to clean surrounding whitespace.

The price extraction follows the same approach:

price = listing_block.css(".card-price").text.strip()

We use the .card-price CSS class to target the price text specifically.

One tricky part is handling fields like beds and baths that may be missing:

beds_element = listing_block.css("[data-testid='property-meta-beds']")

beds = beds_element.text.strip() unless beds_element.empty?

First we try to select the beds element. Then we check if that returned no results by calling .empty? Before extracting text, otherwise we'd get an error.

Full Code Example

Below is the full runnable script to scrape Realtor listings:

require 'nokogiri'
require 'httparty'

# Define the URL of the Realtor.com search page
url = "https://www.realtor.com/realestateandhomes-search/San-Francisco_CA"

# Define a User-Agent header
headers = {
  "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"  # Replace with your User-Agent string
}

# Send a GET request to the URL with the User-Agent header
response = HTTParty.get(url, headers: headers)

# Check if the request was successful (status code 200)
if response.code == 200
  # Parse the HTML content of the page using Nokogiri
  doc = Nokogiri::HTML(response.body)

  # Find all the listing blocks using the provided class name
  listing_blocks = doc.css(".BasePropertyCard_propertyCardWrap__J0xUj")

  # Loop through each listing block and extract information
  listing_blocks.each do |listing_block|
    # Extract the broker information
    broker_info = listing_block.css(".BrokerTitle_brokerTitle__ZkbBW")
    broker_name = broker_info.css(".BrokerTitle_titleText__20u1P").text.strip()

    # Extract the status (e.g., For Sale)
    status = listing_block.css(".message").text.strip()

    # Extract the price
    price = listing_block.css(".card-price").text.strip()

    # Extract other details like beds, baths, sqft, and lot size
    beds_element = listing_block.css("[data-testid='property-meta-beds']")
    baths_element = listing_block.css("[data-testid='property-meta-baths']")
    sqft_element = listing_block.css("[data-testid='property-meta-sqft']")
    lot_size_element = listing_block.css("[data-testid='property-meta-lot-size']")

    # Check if the elements exist before extracting their text
    beds = beds_element.text.strip() unless beds_element.empty?
    baths = baths_element.text.strip() unless baths_element.empty?
    sqft = sqft_element.text.strip() unless sqft_element.empty?
    lot_size = lot_size_element.text.strip() unless lot_size_element.empty?

    # Extract the address
    address = listing_block.css(".card-address").text.strip()

    # Print the extracted information
    puts "Broker: #{broker_name}"
    puts "Status: #{status}"
    puts "Price: #{price}"
    puts "Beds: #{beds || 'N/A'}"
    puts "Baths: #{baths || 'N/A'}"
    puts "Sqft: #{sqft || 'N/A'}"
    puts "Lot Size: #{lot_size || 'N/A'}"
    puts "Address: #{address}"
    puts "-" * 50  # Separating listings
  end
else
  puts "Failed to retrieve the page. Status code: #{response.code}"
end

The key things to remember are:

  • Use CSS selectors to target specific text elements
  • Handle missing data gracefully with checks
  • Don't modify the literal class/ID strings as they depend on the live site markup
  • With some debugging and tweaking, you can adapt this approach to build scrapers extracting all sorts of data!

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!