Have you ever wanted to analyze real estate listing data from sites like Realtor.com? Web scraping using tools like Jsoup provide a powerful way to automatically extract that data for custom analysis.

This is the listings page we are talking about…

In this comprehensive guide for beginners, you'll learn step-by-step how to use Jsoup to scrape key details from Realtor listings in Kotlin. Follow along to get hands-on practice with core scraping concepts like:

Crafting a GET request and user agent

Selecting HTML elements with CSS selectors

Extracting and transforming text from elements

Dealing with missing data

We'll go extremely in-depth on the most complex part - the CSS selectors - since that's what allows extracting the right data points.

By the end, you'll understand how each piece of this scraper works to extract details like:

Listing broker name

Status (For Sale, etc)

Price

Beds

Baths

Square footage

Lot size

Full address

...and more from any result on a Realtor.com search page!

The code can work as-is to scrape listings for any city, while what you learn will apply to building scrapers for virtually any site.

Let's dig in!

Scraping Overview

First, what exactly happens in this program?

At a high level, it:

Sends a GET request to retrieve the HTML from a Realtor search URL
Parses the HTML
Uses CSS selectors to pinpoint specific elements
Extracts text from those elements 5.Outputs the extracted details

That's web scraping in a nutshell! It allows gathering structured data from sites even if they don't have public APIs.

Now let's break down the process for this specific scraper...

Importing Jsoup

Jsoup handles most of the heavy lifting for us. To start, we import it:

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element

This gives access to all the scraping functionality.

Jsoup - Main class for making requests/selecting elements

Document - Represents the entire HTML document

Element - An individual HTML element

We'll see examples later of how these work together.

Defining the Target URL

The first step is choosing what page to scrape. In this case, we want results from a Realtor search:

val url = "<https://www.realtor.com/realestateandhomes-search/San-Francisco_CA>"

This URL will contain many individual property listings we want to extract details from.

Good to Know Tip: You can tweak the location in the URL to scrape other cities!

Creating a User Agent Header

Websites detect programmatic access like scrapers by looking for missing browser characteristics like user agent strings.

To mimic a real browser, we pass a user agent header:

val userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"

This spoofed string matches Chrome on Windows 10.

Pro Tip: I recommend services like https://www.whatismybrowser.com to get your actual browser's user agent string for scraping to better blend in!

Fetching the Page HTML

With the target URL and user agent defined, Jsoup can now grab the page HTML:

val doc = Jsoup.connect(url).userAgent(userAgent).get()

Breaking this down:

Jsoup.connect(url) - Configures a connection to fetch the given URL

.userAgent(userAgent) - Sets the user agent string

.get() - Sends the request and retrieves the HTML

The returned doc contains the entire source code of the page. Think of it like an object representation of the raw HTML.

We can check if (doc != null) to confirm it loaded properly before scraping.

Lightbulb Moment: The user agent header tricks Realtor.com into thinking the request came from a real browser!

Selecting Elements to Scrape

Here's where the real magic happens - using CSS selectors to pinpoint parts of the page to extract.

Understanding CSS Selectors

CSS selectors allow targeting HTML elements by their id, class, tag name, attributes, hierarchy in the DOM tree and more.

Some examples:

/* Target element with id property matching "myId" */
#myId

/* Target elements with class name matching "myClass" */
.myClass

/* Target h1 elements */
h1

/* Target elements with attribute name="value" */
[name="value"]

Jsoup translates these selectors into matching Java Element objects from the HTML.

The full syntax and possibilities get complex, but as you'll see, just a few selector types cover most scraping cases!

Selecting Realtor Listing Blocks

Inspecting the element

When we inspect element in Chrome we can see that each of the listing blocks is wrapped in a div with a class value as shown below…

Now back to our program...

The first thing we want is each overall listing block from the page. Realtor conveniently marks these with a class name:

val listingBlocks = doc.select("div.BasePropertyCard\\_propertyCardWrap\\_\\_J0xUj")

Breaking this selector down:

doc - The Document representing the full HTML

select() - Finds Elements matching the CSS selector

"div.BasePropertyCard\\_propertyCardWrap\\_\\_J0xUj" - Selector for the listing block elements

This matches all

tags with that exact class name in the HTML.

Jsoup returns a list of Element objects representing each matching block.

Pro Tip: Most scrapers start by identifying these wrapper containers around the actual target data.

Extracting Listing Details

Within each listing block lies the useful info like price and broker details. To extract those values, the selectors dig deeper into the block's hierarchy:

Broker Info

val brokerInfo = listingBlock.selectFirst("div.BrokerTitle\\_brokerTitle\\_\\_ZkbBW")

val brokerName = brokerInfo?.selectFirst("span.BrokerTitle\\_titleText\\_\\_20u1P")?.text()?.trim() ?: "N/A"

Here's what's happening:

Get div containing broker info
Select the span element inside for the name text
Extract .text() string value
Trim whitespace with .trim()
Return "N/A" if missing (?. operators)

This drills down multiple levels to get the desired text!

Other Examples

It follows a similar pattern for all fields:

val status = listingBlock.selectFirst("div.message")?.text()?.trim() ?: "N/A"

val price = listingBlock.selectFirst("div.card-price")?.text()?.trim() ?: "N/A"

val bedsElement = listingBlock.selectFirst("li\\[data-testid=property-meta-beds\\]")
// ...get text from bedsElement

Finding the right selectors takes experimentation - viewing the page HTML and trial and error.

Tools like the browser Developer Tools help speed this up too!

But the patterns are consistent:

Match on element id, class names or attributes

Traverse down through divs to reach target text

Extract .text() value

Return default if missing

Practice Tip: Try tweaking the selectors in this code and observe the impact!

Outputting Scraped Data

Finally, with all the key details extracted, we simply print them out:

println("Broker: $brokerName")
println("Status: $status")
println("Price: $price")
// ...

The end result shows all listings, neatly formatted with the scraped information.

And there you have it - from connecting to the page to extracting each field, that's everything this scraper is doing under the hood!

Key Takeways

Let's recap some key lessons around Jsoup and web scraping:

Tools like Jsoup handle much of the low-level work, from sending requests to parsing HTML

User agent headers mimic real browsers

CSS selectors pinpoint elements to extract, often by class/id

Nested selectors dig deeper into the HTML hierarchy

Text and attributes can be extracted from matched Element objects

Missing data must be handled (default values)

The concepts apply the same whether scraping listings, articles or any other site content.

While it takes practice, scraping opens up countless possibilities to utilize web data in your programs!

Full Code

For reference, here is the complete code once more:

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element

fun main() {
    // Define the URL of the Realtor.com search page
    val url = "https://www.realtor.com/realestateandhomes-search/San-Francisco_CA"

    // Define a User-Agent header
    val userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"

    // Send a GET request to the URL with the User-Agent header
    val doc = Jsoup.connect(url).userAgent(userAgent).get()

    // Check if the request was successful (status code 200)
    if (doc != null) {
        // Find all the listing blocks using the provided class name
        val listingBlocks = doc.select("div.BasePropertyCard_propertyCardWrap__J0xUj")

        // Loop through each listing block and extract information
        for (listingBlock in listingBlocks) {
            // Extract the broker information
            val brokerInfo = listingBlock.selectFirst("div.BrokerTitle_brokerTitle__ZkbBW")
            val brokerName = brokerInfo?.selectFirst("span.BrokerTitle_titleText__20u1P")?.text()?.trim() ?: "N/A"

            // Extract the status (e.g., For Sale)
            val status = listingBlock.selectFirst("div.message")?.text()?.trim() ?: "N/A"

            // Extract the price
            val price = listingBlock.selectFirst("div.card-price")?.text()?.trim() ?: "N/A"

            // Extract other details like beds, baths, sqft, and lot size
            val bedsElement = listingBlock.selectFirst("li[data-testid=property-meta-beds]")
            val bathsElement = listingBlock.selectFirst("li[data-testid=property-meta-baths]")
            val sqftElement = listingBlock.selectFirst("li[data-testid=property-meta-sqft]")
            val lotSizeElement = listingBlock.selectFirst("li[data-testid=property-meta-lot-size]")

            // Check if the elements exist before extracting their text
            val beds = bedsElement?.text()?.trim() ?: "N/A"
            val baths = bathsElement?.text()?.trim() ?: "N/A"
            val sqft = sqftElement?.text()?.trim() ?: "N/A"
            val lotSize = lotSizeElement?.text()?.trim() ?: "N/A"

            // Extract the address
            val address = listingBlock.selectFirst("div.card-address")?.text()?.trim() ?: "N/A"

            // Print the extracted information
            println("Broker: $brokerName")
            println("Status: $status")
            println("Price: $price")
            println("Beds: $beds")
            println("Baths: $baths")
            println("Sqft: $sqft")
            println("Lot Size: $lotSize")
            println("Address: $address")
            println("-".repeat(50))  // Separating listings
        }
    } else {
        println("Failed to retrieve the page.")
    }
}

Scraping Real Estate Listings From Realtor in Kotlin

Scraping Overview

Importing Jsoup

Defining the Target URL

Creating a User Agent Header

Fetching the Page HTML

Selecting Elements to Scrape

Understanding CSS Selectors

Selecting Realtor Listing Blocks

Extracting Listing Details

Broker Info

Other Examples

Outputting Scraped Data

Key Takeways

Full Code

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping Real Estate Listings From Realtor in Kotlin

Scraping Overview

Importing Jsoup

Defining the Target URL

Creating a User Agent Header

Fetching the Page HTML

Selecting Elements to Scrape

Understanding CSS Selectors

Selecting Realtor Listing Blocks

Extracting Listing Details

Broker Info

Other Examples

Outputting Scraped Data

Key Takeways

Full Code

The easiest way to do Web Scraping

Don't leave just yet!