Scraping Real Estate Listings From Realtor with PHP

Jan 9, 2024 · 8 min read

In this article, we will learn how to scrape real estate listings from Realtor.com using PHP and cURL. The full code is provided at the end for reference as we go through each section in detail.

We will cover:

  • Installing requirements like PHP and cURL
  • Defining the target URL
  • Setting headers like the user agent
  • Initializing and configuring cURL
  • Sending the request and handling the response
  • Using DOMDocument and XPath to extract data
  • Outputting and formatting the scraped listing details
  • This is the listings page we are talking about…

    Let's get started!

    Installation

    Before running the scraping script, some prerequisites need to be installed:

    Installing/Enabling cURL

    This script also requires the cURL extension for PHP which allows making HTTP requests in PHP.

    cURL may already be enabled depending on your PHP installation. You can check if it is enabled by running:

    php -m | grep curl
    

    If you see "curl" in the output, then cURL is installed and you should be good to go.

    If cURL is not enabled, follow these instructions to install and enable it for PHP.

    In short, on Ubuntu/Debian:

    sudo apt-get install php-curl
    sudo service apache2 restart
    

    And the cURL extension should now be enabled.

    Okay, with the prerequisites out of the way, let's look at the code!

    Defining the Target URL

    First, we define the root URL of the Realtor.com page we want to scrape:

    $url = "<https://www.realtor.com/realestateandhomes-search/San-Francisco\\_CA>";
    

    This URL targets real estate listings in San Francisco.

    When web scraping, always double check you have the right homepage/search URL to scrape from! This saves trouble later.

    Setting Request Headers with User-Agent

    Many websites block default browser user agents these days to prevent scraping. So it's important to spoof a browser:

    $headers = [
    "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    ];
    

    We define a heades array containing a typical Chrome desktop user agent string. This mimics a real browser visit to avoid blocks.

    There are services where you can generate browser user agents for free. Or grab them from your browsers dev tools Network tab on a site.

    Initializing and Configuring cURL

    With target prepared, we can initialize a cURL session which will handle connecting to the page and transferring data:

    $ch = curl_init();
    

    Next we set a number of cURL options to configure our request:

    curl_setopt($ch, CURLOPT_URL, $url);
    
    curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
    
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    

    Let's understand them:

  • CURLOPT_URL: The target request URL we defined earlier.
  • CURLOPT_HTTPHEADER: Sets custom headers, here our user agent string.
  • CURLOPT_RETURNTRANSFER: Returns output directly instead of printing, to allow catching in a variable.
  • These are the main options needed. There are many more available to configure proxies, authentication, cookies, etc.

    Sending the Request and Processing Output

    With cURL initialized, headers set, and options configured - we send the GET request:

    $response = curl_exec($ch);
    

    This executes the cURL request defined by our handle $ch and returns output in $response.

    We can check if the request succeeded:

    if ($response && curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200) {
       // Request succeeded!
    } else {
       // Request failed
    }
    

    Here we:

    1. Check if $response contains a value
    2. Use curl_getinfo() to get the HTTP status code and compare to 200 OK

    This protects against failed requests down the line.

    Parsing Listings with DOMDocument

    With HTML response stored, we can parse it to extract data. We'll use PHP's DOMDocument and DOMXPath to query elements:

    $dom = new DOMDocument();
    $dom->loadHTML($response);
    
    $xpath = new DOMXPath($dom);
    

    This:

    1. Creates a new DOMDocument
    2. Loads our HTML response to parse
    3. Gets an XPath object to query DOM elements

    XPath allows querying elements by CSS selector like syntax.

    Extracting Listing Data

    Now the fun part - using XPath to extract key real estate listing details!

    Inspecting the element

    When we inspect element in Chrome we can see that each of the listing blocks is wrapped in a div with a class value as shown below…

    First we locate all listing container elements on the page:

    $listing_blocks = $xpath->query('//div[contains(@class, "BasePropertyCard_propertyCardWrap__J0xUj")]');
    

    This queries DIVs with the class BasePropertyCard_propertyCardWrap__J0xUj which contain data for individual properties.

    Note: We check for the class name containing that string since it can be minified or changed compared to inspecting the element in browser dev tools.

    With listing containers selected, we loop through them:

    foreach ($listing_blocks as $listing_block) {
    
      // Extract data from $listing_block
    
    }
    

    And use XPath queries within each listing block to extract details, e.g.:

    Get broker name:

    $broker_info = $xpath->query('.//div[contains(@class, "BrokerTitle_brokerTitle__ZkbBW")]', $listing_block);
    
    $broker_name = $broker_info->textContent;
    

    Get number of beds:

    $beds = $xpath->query('.//li[@data-testid="property-meta-beds"]', $listing_block)->textContent;
    

    And similarly for other fields like status, price, baths, etc. targeted by class/attribute values.

    Note again how we match partial class names and check attributes to create resilient selectors.

    The key is properly crafting XPath queries to extract intended listing data from the DOM.

    Output and Formatting

    Finally, we output and format the scraped details:

    echo "Broker: " . $broker_name . "\\n";
    
    echo "Beds: " . $beds . "\\n";
    
    // And so on...
    
    echo str_repeat("-", 50) . "\\n"; // Separator
    

    Here we:

  • Echo the data fields, concatenating labels
  • Add newlines between fields for readability
  • Print a dashed line to separate individual listings
  • The output can then be processed further based on the use case - maybe store to CSV or JSON, insert to a database, etc.

    And that covers a typical web scraping workflow!

    Full Code

    For reference, here is the full code:

    <?php
    // Define the URL of the Realtor.com search page
    $url = "https://www.realtor.com/realestateandhomes-search/San-Francisco_CA";
    
    // Define a User-Agent header
    $headers = [
        "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"  // Replace with your User-Agent string
    ];
    
    // Initialize cURL session
    $ch = curl_init();
    
    // Set cURL options
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    
    // Send a GET request to the URL with the User-Agent header
    $response = curl_exec($ch);
    
    // Check if the request was successful (HTTP status code 200)
    if ($response && curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200) {
        // Parse the HTML content of the page using DOMDocument and DOMXPath
        $dom = new DOMDocument();
        libxml_use_internal_errors(true); // Suppress parsing errors
        $dom->loadHTML($response);
        libxml_clear_errors();
    
        // Create an XPath object
        $xpath = new DOMXPath($dom);
    
        // Find all the listing blocks using the provided class name
        $listing_blocks = $xpath->query('//div[contains(@class, "BasePropertyCard_propertyCardWrap__J0xUj")]');
    
        // Loop through each listing block and extract information
        foreach ($listing_blocks as $listing_block) {
            // Extract the broker information
            $broker_info = $xpath->query('.//div[contains(@class, "BrokerTitle_brokerTitle__ZkbBW")]', $listing_block)->item(0);
            $broker_name = $xpath->query('.//span[contains(@class, "BrokerTitle_titleText__20u1P")]', $broker_info)->item(0)->textContent;
    
            // Extract the status (e.g., For Sale)
            $status = $xpath->query('.//div[contains(@class, "message")]', $listing_block)->item(0)->textContent;
    
            // Extract the price
            $price = $xpath->query('.//div[contains(@class, "card-price")]', $listing_block)->item(0)->textContent;
    
            // Extract other details like beds, baths, sqft, and lot size
            $beds_element = $xpath->query('.//li[@data-testid="property-meta-beds"]', $listing_block)->item(0);
            $baths_element = $xpath->query('.//li[@data-testid="property-meta-baths"]', $listing_block)->item(0);
            $sqft_element = $xpath->query('.//li[@data-testid="property-meta-sqft"]', $listing_block)->item(0);
            $lot_size_element = $xpath->query('.//li[@data-testid="property-meta-lot-size"]', $listing_block)->item(0);
    
            // Check if the elements exist before extracting their text
            $beds = $beds_element ? $beds_element->textContent : "N/A";
            $baths = $baths_element ? $baths_element->textContent : "N/A";
            $sqft = $sqft_element ? $sqft_element->textContent : "N/A";
            $lot_size = $lot_size_element ? $lot_size_element->textContent : "N/A";
    
            // Extract the address
            $address = $xpath->query('.//div[contains(@class, "card-address")]', $listing_block)->item(0)->textContent;
    
            // Print the extracted information
            echo "Broker: " . trim($broker_name) . "\n";
            echo "Status: " . trim($status) . "\n";
            echo "Price: " . trim($price) . "\n";
            echo "Beds: " . trim($beds) . "\n";
            echo "Baths: " . trim($baths) . "\n";
            echo "Sqft: " . trim($sqft) . "\n";
            echo "Lot Size: " . trim($lot_size) . "\n";
            echo "Address: " . trim($address) . "\n";
            echo str_repeat("-", 50) . "\n";  // Separating listings
        }
    
    } else {
        echo "Failed to retrieve the page. Status code: " . curl_getinfo($ch, CURLINFO_HTTP_CODE) . "\n";
    }
    
    // Close the cURL session
    curl_close($ch);
    ?>

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!