Scraping Real Estate Listings From Realtor in Perl

This article will provide a step-by-step walkthrough of code to scrape real estate listings from Realtor.com. The goal is to explain the key concepts and code structure to help beginners understand how to extract data from websites using web scraping.

This is the listings page we are talking about…

Introduction

Below is full Perl code to scrape details about property listings from Realtor.com for the San Francisco area. This includes information like the listing broker, price, number of beds/baths, square footage, etc.

We'll break down exactly how this works, focusing especially on the XPath selectors used to extract the data. This article won't get into debates around the ethics of web scraping - the purpose is solely to understand the code.

Required Modules

The code starts by importing a few key modules:

use LWP::UserAgent;
use HTML::TreeBuilder::XPath;

Briefly:

LWP::UserAgent - Sends HTTP requests and handles responses

HTML::TreeBuilder::XPath - Parses HTML content and allows data selection with XPath

No need to fully understand these now, just know they provide critical web scraping functionality.

Define URL and User Agent

Next we set up the URL to scrape and configure a User-Agent header:

my $url = "<https://www.realtor.com/realestateandhomes-search/San-Francisco_CA>";

my $ua = LWP::UserAgent->new;
$ua->agent("Mozilla/5.0..."); // Full browser user agent

This URL points to Realtor.com results for San Francisco. The User-Agent makes our requests look like they came from a real browser.

Send Request and Parse Response

We send a GET request to fetch the page content:

my $response = $ua->get($url);

Then we check that it succeeded with a 200 status code before parsing the HTML:

if ($response->is_success) {

  my $tree = HTML::TreeBuilder::XPath->new_from_content($response->content);

  // ... extract data here ...

}

The $tree will allow us to query the document using XPath selectors.

Extracting Listing Data with XPath

Now we arrive at the key part - using XPath expressions to extract details about each property listing on the results page.

Inspecting the element

When we inspect element in Chrome we can see that each of the listing blocks is wrapped in a div with a class value as shown below…

The listings are contained in div blocks with a class name like:

<div class="BasePropertyCard_propertyCardWrap__J0xUj">

We grab all of them with this XPath selector:

my @listing_blocks = $tree->findnodes('//div[contains(@class,"BasePropertyCard_propertyCardWrap__J0xUj")]');

Let's break this down:

//div - Match any

elements anywhere in the document

contains(@class, "BasePropertyCard_propertyCardWrap__J0xUj") - Filter to only those divs where the class contains this literal name

We'll loop through each listing block and extract details using more XPath queries...

Broker Name

The broker name is stored nested like:

<div class="BrokerTitle_brokerTitle__ZkbBW">
  <span class="BrokerTitle_titleText__20u1P">Keller Williams</span>
</div>

We handle this with:

my $broker_info = $listing_block->findvalue('.//div[contains(@class,"BrokerTitle_brokerTitle__ZkbBW")]');

my ($broker_name) = $broker_info =~ /<span[^>]+class="BrokerTitle_titleText__20u1P"[^>]*>(.+?)<\\/span>/;

Breaking this down:

findvalue() - Returns a string containing raw HTML

//div[contains(@class,"BrokerTitle_brokerTitle__ZkbBW")] - Get broker div

Regular expression extracts content between tags

Listing Status

The status text (e.g. "For Sale") resides in:

<div class="message">For Sale</div>

We grab it through:

my $status = $listing_block->findvalue('.//div[contains(@class,"message")]');

Match on div with class message

Price

Similarly, the price element looks like:

<div class="card-price">$699,000</div>

Extract with:

my $price = $listing_block->findvalue('.//div[contains(@class,"card-price")]');

And so on for other fields like beds, baths, square footage, etc. Each has a div or li containing the data we want. We won't walk through every one, but they follow a similar pattern.

Output Results

Finally, a few print statements neatly output all the extracted details:

print("Broker: $broker\\_name\\n");
print("Status: $status\\n");
print("Price: $price\\n");
// ...

And that's the key functionality covered! The full code can be found in the next section.

While long, this ultimately shows how we can traverse HTML structures and pull data with XPath selectors. With some practice, you'll get the hang of common patterns.

Full Code

Here is the complete code example for reference:

use strict;
use warnings;
use LWP::UserAgent;
use HTML::TreeBuilder::XPath;

# Define the URL of the Realtor.com search page
my $url = "https://www.realtor.com/realestateandhomes-search/San-Francisco_CA";

# Define a User-Agent header
my $ua = LWP::UserAgent->new;
$ua->agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36");

# Send a GET request to the URL with the User-Agent header
my $response = $ua->get($url);

# Check if the request was successful (status code 200)
if ($response->is_success) {
    # Parse the HTML content of the page using HTML::TreeBuilder::XPath
    my $tree = HTML::TreeBuilder::XPath->new_from_content($response->content);

    # Find all the listing blocks using the provided class name
    my @listing_blocks = $tree->findnodes('//div[contains(@class,"BasePropertyCard_propertyCardWrap__J0xUj")]');

    # Loop through each listing block and extract information
    foreach my $listing_block (@listing_blocks) {
        # Extract the broker information
        my $broker_info = $listing_block->findvalue('.//div[contains(@class,"BrokerTitle_brokerTitle__ZkbBW")]');
        my ($broker_name) = $broker_info =~ /<span[^>]*class="BrokerTitle_titleText__20u1P"[^>]*>(.*?)<\/span>/;

        # Extract the status (e.g., For Sale)
        my $status = $listing_block->findvalue('.//div[contains(@class,"message")]');

        # Extract the price
        my $price = $listing_block->findvalue('.//div[contains(@class,"card-price")]');

        # Extract other details like beds, baths, sqft, and lot size
        my $beds_element = $listing_block->findvalue('.//li[@data-testid="property-meta-beds"]');
        my $baths_element = $listing_block->findvalue('.//li[@data-testid="property-meta-baths"]');
        my $sqft_element = $listing_block->findvalue('.//li[@data-testid="property-meta-sqft"]');
        my $lot_size_element = $listing_block->findvalue('.//li[@data-testid="property-meta-lot-size"]');

        # Check if the elements exist before extracting their text
        my $beds = $beds_element || "N/A";
        my $baths = $baths_element || "N/A";
        my $sqft = $sqft_element || "N/A";
        my $lot_size = $lot_size_element || "N/A";

        # Extract the address
        my $address = $listing_block->findvalue('.//div[contains(@class,"card-address")]');

        # Print the extracted information
        print("Broker: $broker_name\n");
        print("Status: $status\n");
        print("Price: $price\n");
        print("Beds: $beds\n");
        print("Baths: $baths\n");
        print("Sqft: $sqft\n");
        print("Lot Size: $lot_size\n");
        print("Address: $address\n");
        print("-" x 50 . "\n");  # Separating listings
    }
} else {
    print("Failed to retrieve the page. Status code: " . $response->code . "\n");
}

Scraping Real Estate Listings From Realtor in Perl

Introduction

Required Modules

Define URL and User Agent

Send Request and Parse Response

Extracting Listing Data with XPath

Broker Name

Listing Status

Price

Output Results

Full Code

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping Real Estate Listings From Realtor in Perl

Introduction

Required Modules

Define URL and User Agent

Send Request and Parse Response

Extracting Listing Data with XPath

Broker Name

Listing Status

Price

Output Results

Full Code

The easiest way to do Web Scraping

Don't leave just yet!