Scarping All The Images From a Website in PHP

Dec 13, 2023 · 9 min read

The goal of the PHP script we will be discussing is to scrape all images available on a Wikipedia page that lists dog breeds. Specifically, it extracts the name, group, local name, and image URL for each breed listed on the page.

This is page we are talking about…

Importing Required Libraries

We first need to import some PHP libraries that will enable sending HTTP requests and parsing HTML:

require 'simple_html_dom.php';

The simple_html_dom library allows easily manipulating HTML and XML documents. We will use this later to parse the content of the Wikipedia page.

Defining the Target URL

Next, we store the URL of the Wikipedia page in a variable:

$url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>';

This is the page that contains the data we want to scrape.

Setting a User Agent

Websites can identify requests coming from scripts vs browsers. To mimic a browser request, we need to define a user agent header:

$options = [
  'http' => [
    'method' => 'GET',
    'header' => 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
  ]
];

This makes the request appear like it's coming from a Chrome browser running on Windows.

Sending HTTP Request

Now we can send a GET request to fetch the content of the target URL:

$context = stream_context_create($options);

$response = file_get_contents($url, false, $context);

The stream context applies the headers we defined earlier.

We can check the HTTP status code to verify that the request succeeded:

if ($response !== false) {

  // Request succeeded

} else {

  // Request failed

}

A status code of 200 means the request was successful.

Parsing the HTML

Since the request succeeded, we have the HTML content of the Wikipedia page saved in the $response variable.

We can parse this using simple_html_dom's str_get_html() method:

$html = str_get_html($response);

This will convert the HTML into a special object that we can traverse using DOM selectors.

Identifying Key Elements

Inspecting the page

You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

Our goal is to extract the data from a specific table on the page. We need to first find this table element:

$table = $html->find('table.wikitable.sortable', 0);

This finds the table with classes wikitable and sortable, and returns the first match.

Initializing Data Arrays

Let's initialize some empty arrays to store the data we extract:

$names = [];

$groups = [];

$local_names = [];

$photographs = [];

Creating Image Directory

Since we want to download the images of each dog breed, let's create a folder called dog_images to save them:

if (!is_dir('dog_images')) {

  mkdir('dog_images');

}

This will create the folder if it doesn't already exist.

Extracting Data from Table Rows

Now we can loop through each row inside the table we located earlier:

foreach ($table->find('tr') as $row) {

  // Extract data from each row

}

Traversing Row Cells

Inside the loop, we first need to grab the cells in each row:

$columns = $row->find('td, th');

We check if there are exactly 4 cells, since rows with less are not data rows:

if (count($columns) == 4) {

  // Extract data from cells

}

Now we can extract the data we need from the cells.

Understanding Selectors and Data Extraction

The most complex part of web scraping is identifying the correct HTML elements to extract the data you need.

This requires traversing the DOM structure and targeting elements using CSS selectors or other methods exposed by the HTML parsing library.

Let's break down how data is extracted from each cell in this script:

Name Column

The name is wrapped in an anchor tag inside the first cell:

<td>
  <a href="/dog/affenpinscher">Affenpinscher</a>
</td>

We use DOM traversal to find the anchor tag and get its plain text:

$name = trim($columns[0]->find('a', 0)->plaintext);

Breaking this down:

  • $columns[0] - Get first cell
  • find('a', 0) - Find first anchor tag inside cell
  • plaintext - Extract text content of element
  • trim() - Remove whitespace
  • This stores the value "Affenpinscher" in the $name variable.

    Group Column

    The group name is directly inside the second cell:

    <td>FCI Group 2, Section 1</td>
    

    So we can directly extract the cell's text content:

    $group = trim($columns[1]->plaintext);
    

    This stores "FCI Group 2, Section 1" in $group.

    Local Name Column

    Some rows contain a inside the third cell:

    <td><span>Mops</span></td>
    

    We check if this span exists before getting its text:

    $span_tag = $columns[2]->find('span', 0);
    
    $local_name = $span_tag ? trim($span_tag->plaintext) : '';
    

    If the span exists, we store its text in $local_name, else we set it to an empty string.

    Image URL Column

    The last cell contains the image we want to download. We check if there is an tag:

    $img_tag = $columns[3]->find('img', 0);
    
    $photograph = $img_tag ? $img_tag->src : '';
    

    If found, we get the image source URL from the src attribute. Else set it to empty string.

    As you can see, accurately locating the data relies heavily on analyzing the HTML structure and using the correct selectors and traversal methods.

    The reason the literal strings are preserved inside selectors is because they directly correspond to elements on the page. Changing them would break the data extraction!

    Downloading and Saving Images

    If an image URL is found, we download and save it:

    if ($photograph) {
    
      // Download image
      $image_data = file_get_contents($image_url);
    
      // Save to folder
      file_put_contents($image_filename, $image_data);
    
    }
    

    We use unique filenames like "dog_images/affenpinscher.jpg" to prevent conflicts.

    Appending Extracted Data

    After extraction, we append the data from each row to our arrays:

    $names[] = $name;
    $groups[] = $group;
    $local_names[] = $local_name;
    $photographs[] = $photograph;
    

    This builds up the arrays containing all the scraped data.

    Processing the Scraped Data

    Finally, we can work with the data in the arrays:

    for ($i = 0; $i < count($names); $i++) {
            echo "Name: " . $names[$i] . "\n";
            echo "FCI Group: " . $groups[$i] . "\n";
            echo "Local Name: " . $local_names[$i] . "\n";
            echo "Photograph: " . $photographs[$i] . "\n\n";
        }

    We may also write it to a file, database, etc. for future use.

    Full Code

    Here is the complete code again for reference:

    <?php
    // Include the required PHP libraries
    require 'simple_html_dom.php';
    
    // URL of the Wikipedia page
    $url = 'https://commons.wikimedia.org/wiki/List_of_dog_breeds';
    
    // Define a user-agent header to simulate a browser request
    $options = [
        'http' => [
            'method' => 'GET',
            'header' => 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
        ]
    ];
    
    $context = stream_context_create($options);
    
    // Send an HTTP GET request to the URL with the headers
    $response = file_get_contents($url, false, $context);
    
    // Check if the request was successful (HTTP status code 200)
    if ($response !== false) {
        // Parse the HTML content of the page
        $html = str_get_html($response);
    
        // Find the table with class 'wikitable sortable'
        $table = $html->find('table.wikitable.sortable', 0);
    
        // Initialize arrays to store the data
        $names = [];
        $groups = [];
        $local_names = [];
        $photographs = [];
    
        // Create a directory to save the images
        if (!is_dir('dog_images')) {
            mkdir('dog_images');
        }
    
        // Iterate through rows in the table (skip the header row)
        foreach ($table->find('tr') as $row) {
            $columns = $row->find('td, th');
            if (count($columns) == 4) {
                // Extract data from each column
                $name = trim($columns[0]->find('a', 0)->plaintext);
                $group = trim($columns[1]->plaintext);
    
                // Check if the second column contains a span element
                $span_tag = $columns[2]->find('span', 0);
                $local_name = $span_tag ? trim($span_tag->plaintext) : '';
    
                // Check for the existence of an image tag within the fourth column
                $img_tag = $columns[3]->find('img', 0);
                $photograph = $img_tag ? $img_tag->src : '';
    
                // Download the image and save it to the folder
                if ($photograph) {
                    $image_url = $photograph;
                    $image_data = file_get_contents($image_url);
                    if ($image_data !== false) {
                        $image_filename = 'dog_images/' . $name . '.jpg';
                        file_put_contents($image_filename, $image_data);
                    }
                }
    
                // Append data to respective arrays
                $names[] = $name;
                $groups[] = $group;
                $local_names[] = $local_name;
                $photographs[] = $photograph;
            }
        }
    
        // Print or process the extracted data as needed
        for ($i = 0; $i < count($names); $i++) {
            echo "Name: " . $names[$i] . "\n";
            echo "FCI Group: " . $groups[$i] . "\n";
            echo "Local Name: " . $local_names[$i] . "\n";
            echo "Photograph: " . $photographs[$i] . "\n\n";
        }
    
    } else {
        echo "Failed to retrieve the web page.\n";
    }
    ?>

    Tricks and Tips

    Here are some handy tricks for web scraping:

  • Use browser DevTools to inspect elements
  • View page source to understand HTML structure
  • Handle HTTP errors correctly with status codes
  • Set delays between requests to avoid overload
  • Use CSS selectors for accuracy in targeting elements
  • Normalize inconsistent data after scraping
  • In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!