Scraping All Images from a Website with Perl

Dec 13, 2023 · 7 min read

This guide will walk through a Perl script to scrape image URLs and other data from a Wikipedia page. We will extract the names, groups, local names, and image URLs for all dog breeds listed on the page.

This is page we are talking about…

Modules Used

The script uses the following modules which may need to be installed:

use LWP::UserAgent;
use HTML::TreeBuilder;

To install these, run:

cpan LWP::UserAgent HTML::TreeBuilder

Define URL and User Agent

First we define the URL of the Wikipedia page we want to scrape:

my $url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>';

Next we create a User Agent header to mimic a browser request:

my $ua = LWP::UserAgent->new(
  agent => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
);

Send Request and Parse HTML

We send a GET request for the URL and check if it succeeded:

my $response = $ua->get($url);

if ($response->is_success) {

  # Parse HTML
  my $tree = HTML::TreeBuilder->new;
  $tree->parse($response->content);

  # ... rest of code
}

If successful, we use HTML::TreeBuilder to parse the HTML content into an object structure we can traverse.

Extract Data from the Table

Inspecting the page

You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

We find this table element:

my $table = $tree->look_down(_tag => 'table', class => 'wikitable sortable');

We define arrays to store the scraped data fields:

my @names;
my @groups;
my @local_names;
my @photographs;

And create a folder to save images:

mkdir('dog_images') unless -d 'dog_images';

Understanding the Selectors

The most complex part is extracting the data within each row of the table. This is done by the selector code:

my @rows = $table->look_down(_tag => 'tr');
shift @rows; # skip header row

for my $row (@rows) {

  my @columns = $row->look_down(_tag => qr/^(td|th)$/);

  if (@columns == 4) {

    # Extract data from each column
    my $name = $columns[0]->look_down(_tag => 'a')->as_text;
    my $group = $columns[1]->as_text;

    my $span_tag = $columns[2]->look_down(_tag => 'span');
    my $local_name = $span_tag ? $span_tag->as_text : '';

    my $img_tag = $columns[3]->look_down(_tag => 'img');
    my $photograph = $img_tag ? $img_tag->attr('src') : '';

    # Download images
    if ($photograph) {
     // image download code
    }

    # Store data
    push @names, $name;
    push @groups, $group;
    push @local_names, $local_name;
    push @photographs, $photograph;

  }
}

This code loops through each row, gets the columns, and extracts data from the columns:

Name Column

The name is within a tag inside the first column:

Group Column

The group name is directly the text content of the second column:

Local Name Column

There may be a tag with the local name. We check if this exists:

Image Column

We check if there is an tag inside the 4th column:

If found, we extract the src attribute which contains the image URL.

The key things to understand are:

  • look_down() searches elements recursively for matching selectors
  • as_text returns the text content of an element
  • attr() gets the attribute value from a tag
  • This allows us to traverse the HTML structure and extract precisely the data we want.

    The rest of the code downloads images and stores the scraped data into the arrays.

    Output Data

    Finally, the data can be printed out:

    So in summary, this script:

    1. Fetches the web page HTML
    2. Parses it into a traversable structure
    3. Uses selectors to extract specific data
    4. Downloads images
    5. Stores and prints the scraped data

    Full Code

    Here is the complete runnable script:

    use strict;
    use warnings;
    use LWP::UserAgent;
    use HTML::TreeBuilder;
    
    # URL of the Wikipedia page
    my $url = 'https://commons.wikimedia.org/wiki/List_of_dog_breeds';
    
    # Define a user-agent header to simulate a browser request
    my $ua = LWP::UserAgent->new(
        agent => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    );
    
    # Send an HTTP GET request to the URL with the headers
    my $response = $ua->get($url);
    
    # Check if the request was successful (status code 200)
    if ($response->is_success) {
        my $content = $response->content;
    
        # Parse the HTML content of the page
        my $tree = HTML::TreeBuilder->new;
        $tree->parse($content);
    
        # Find the table with class 'wikitable sortable'
        my $table = $tree->look_down(_tag => 'table', class => 'wikitable sortable');
    
        # Initialize arrays to store the data
        my @names;
        my @groups;
        my @local_names;
        my @photographs;
    
        # Create a directory to save the images
        mkdir('dog_images') unless -d 'dog_images';
    
        # Iterate through rows in the table (skip the header row)
        my @rows = $table->look_down(_tag => 'tr');
        shift @rows;  # Skip the header row
    
        for my $row (@rows) {
            my @columns = $row->look_down(_tag => qr/^(td|th)$/);
            if (@columns == 4) {
                # Extract data from each column
                my $name = $columns[0]->look_down(_tag => 'a')->as_text;
                my $group = $columns[1]->as_text;
    
                # Check if the second column contains a span element
                my $span_tag = $columns[2]->look_down(_tag => 'span');
                my $local_name = $span_tag ? $span_tag->as_text : '';
    
                # Check for the existence of an image tag within the fourth column
                my $img_tag = $columns[3]->look_down(_tag => 'img');
                my $photograph = $img_tag ? $img_tag->attr('src') : '';
    
                # Download the image and save it to the folder
                if ($photograph) {
                    my $image_url = $photograph;
                    my $image_filename = "dog_images/$name.jpg";
                    my $img_response = $ua->get($image_url);
                    if ($img_response->is_success) {
                        open(my $img_file, '>:raw', $image_filename) or die "Cannot open $image_filename: $!";
                        print $img_file $img_response->content;
                        close($img_file);
                    }
                }
    
                # Push data into respective arrays
                push @names, $name;
                push @groups, $group;
                push @local_names, $local_name;
                push @photographs, $photograph;
            }
        }
    
        # Print or process the extracted data as needed
        for my $i (0..$#names) {
            print "Name: $names[$i]\n";
            print "FCI Group: $groups[$i]\n";
            print "Local Name: $local_names[$i]\n";
            print "Photograph: $photographs[$i]\n";
            print "\n";
        }
    }
    else {
        die "Failed to retrieve the web page. Status code: " . $response->code;
    }

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!