Scraping Booking.com Property Listings in Scala in 2023

Oct 15, 2023 · 4 min read

In this article, we will learn how to scrape property listings from Booking.com using Scala. We will use Scala libraries like sttp and Scalatags to fetch the HTML content and parse/extract details like property name, location, ratings etc.

Prerequisites

To follow along, you will need:

  • JDK 8+ installed
  • SBT build tool
  • Basic Scala and HTML knowledge
  • Adding Dependencies

    We will use sttp for sending HTTP requests and Scalatags for parsing HTML.

    Add them to build.sbt:

    libraryDependencies ++= Seq(
      "com.softwaremill.sttp.client" %% "core" % "1.6.6",
      "com.lihaoyi" %% "scalatags" % "0.6.7"
    )
    

    This will download the packages when building the project.

    Importing Libraries

    Import the required classes and libraries:

    import sttp.client._
    import scalatags.Text.all._
    
    

    Defining the URL

    Define the target URL to scrape:

    val url = "<https://www.booking.com/searchresults.en-gb.html?ss=New+York&checkin=2023-03-01&checkout=2023-03-05&group_adults=2>"
    

    Setting User Agent

    Set a valid User Agent header:

    val userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    

    Fetching the Page

    Use sttp to send a request and get the response:

    val backend = HttpURLConnectionBackend()
    
    val response = basicRequest
      .get(uri"$url")
      .header("User-Agent", userAgent)
      .send(backend)
    

    We configure the backend, headers and send the request.

    Parsing the HTML

    Load the HTML into a Scalatags tree:

    val html = response.body
    val page = scalatags.Text(html)
    

    Extracting Cards

    Get elements with data-testid attribute:

    val cards = page \\\\ "div" attribute("data-testid" -> "property-card")
    

    This extracts the property card divs.

    Processing Each Card

    Loop through the cards:

    cards.foreach { card =>
    
      // Extract data from card
    
    }
    

    Inside we can extract the details from each card.

    Extracting Title

    Get the h3 text:

    val title = (card \\ "h3").text
    

    Extracting Location

    Get the address span text:

    val location = (card \\ "span" attribute("data-testid" -> "address")).text
    

    Extracting Rating

    Get aria-label attribute value:

    val rating = (card \\ "div" class "e4755bbd60").attr("aria-label")
    

    Filter by CSS class.

    Extracting Review Count

    Get text of the div:

    val reviewCount = (card \\ "div" class "abf093bdfe").text
    

    Extracting Description

    Get the description div text:

    val description = (card \\ "div" class "d7449d770c").text
    

    Printing the Data

    Print out the extracted details:

    println(s"Name: $title")
    println(s"Location: $location")
    println(s"Rating: $rating")
    println(s"Review Count: $reviewCount")
    println(s"Description: $description")
    

    Full Script

    Here is the complete scraping script:

    import sttp.client._
    import scalatags.Text.all._
    
    val url = "<https://www.booking.com/searchresults.en-gb.html?ss=New+York&checkin=2023-03-01&checkout=2023-03-05&group_adults=2>"
    
    val userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    
    val backend = HttpURLConnectionBackend()
    
    val response = basicRequest
      .get(uri"$url")
      .header("User-Agent", userAgent)
      .send(backend)
    
    val html = response.body
    val page = scalatags.Text(html)
    
    val cards = page \\\\ "div" attribute("data-testid" -> "property-card")
    
    cards.foreach { card =>
    
      val title = (card \\ "h3").text
      val location = (card \\ "span" attribute("data-testid" -> "address")).text
      val rating = (card \\ "div" class "e4755bbd60").attr("aria-label")
      val reviewCount = (card \\ "div" class "abf093bdfe").text
      val description = (card \\ "div" class "d7449d770c").text
    
      println(s"Name: $title")
      println(s"Location: $location")
      println(s"Rating: $rating")
      println(s"Review Count: $reviewCount")
      println(s"Description: $description")
    
    }
    

    This scrapes and extracts key data from Booking.com listings using Scala. The same approach can be used for any website.

    While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

    Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

    This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

    With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!