Scraping Booking.com Property Listings in Perl in 2023

Oct 15, 2023 · 4 min read

In this article, we will learn how to scrape property listings from Booking.com using Perl. We will use common Perl modules like LWP::UserAgent and Mojo::DOM to fetch the HTML content and parse/extract details like property name, location, ratings etc.

Prerequisites

To follow along, you will need:

  • Perl 5.10+
  • Basic Perl and HTML knowledge
  • Importing Modules

    Import the modules we need:

    use LWP::UserAgent;
    use Mojo::DOM;
    

    Defining URL

    Define the target URL:

    my $url = '<https://www.booking.com/searchresults.en-gb.html?ss=New+York&checkin=2023-03-01&checkout=2023-03-05&group_adults=2>';
    

    Setting User Agent

    Set a valid user agent string:

    my $user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36';
    

    Fetching HTML Page

    Use LWP::UserAgent to make request:

    my $ua = LWP::UserAgent->new(headers => {'User-Agent' => $user_agent});
    
    my $response = $ua->get($url);
    my $html = $response->content;
    

    We configure the user agent and fetch the HTML.

    Parsing HTML

    Parse HTML using Mojo::DOM:

    my $dom = Mojo::DOM->new($html);
    

    Extracting Cards

    Get elements with data-testid attribute:

    my @cards = $dom->find('div[data-testid=property-card]')->each;
    

    This extracts the property cards.

    Processing Each Card

    Loop through the cards:

    foreach my $card (@cards) {
    
      # Extract data from $card
    
    }
    

    Inside we can extract details from each $card node.

    Extracting Title

    Get the h3 text:

    my $title = $card->at('h3')->text;
    

    Extracting Location

    Get address span text:

    my $location = $card->at('span[data-testid=address]')->text;
    

    Extracting Rating

    Get aria-label attribute value:

    my $rating = $card->at('div.e4755bbd60')->attr('aria-label');
    

    Filter by class.

    Extracting Review Count

    Get div text:

    my $review_count = $card->at('div.abf093bdfe')->text;
    

    Extracting Description

    Get description div text:

    my $description = $card->at('div.d7449d770c')->text;
    

    Printing Output

    Print the extracted data:

    print "Title: $title\\n";
    print "Location: $location\\n";
    print "Rating: $rating\\n";
    print "Review Count: $review_count\\n";
    print "Description: $description\\n";
    

    Full Script

    Here is the complete Perl scraping script:

    use LWP::UserAgent;
    use Mojo::DOM;
    
    my $url = '<https://www.booking.com/searchresults.en-gb.html?ss=New+York&checkin=2023-03-01&checkout=2023-03-05&group_adults=2>';
    
    my $user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36';
    
    my $ua = LWP::UserAgent->new(headers => {'User-Agent' => $user_agent});
    
    my $response = $ua->get($url);
    my $html = $response->content;
    
    my $dom = Mojo::DOM->new($html);
    
    my @cards = $dom->find('div[data-testid=property-card]')->each;
    
    foreach my $card (@cards) {
    
      my $title = $card->at('h3')->text;
      my $location = $card->at('span[data-testid=address]')->text;
      my $rating = $card->at('div.e4755bbd60')->attr('aria-label');
      my $review_count = $card->at('div.abf093bdfe')->text;
      my $description = $card->at('div.d7449d770c')->text;
    
      print "Title: $title\\n";
      print "Location: $location\\n";
      print "Rating: $rating\\n";
      print "Review Count: $review_count\\n";
      print "Description: $description\\n";
    
    }
    

    This extracts key data from Booking.com listings using Perl. The same approach can be used to scrape any site.

    While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

    Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

    This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

    With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: