Downloading Images from a Website with PHP and DOM

Oct 15, 2023 · 5 min read

In this article, we will learn how to use PHP and the DOM extension to download all the images from a Wikipedia page.

—-

Overview

The goal is to extract the names, breed groups, local names, and image URLs for all dog breeds listed on this Wikipedia page. We will store the image URLs, download the images and save them to a local folder.

Here are the key steps we will cover:

  1. Include required modules
  2. Send HTTP request to fetch the Wikipedia page
  3. Parse the page HTML using DOMDocument
  4. Find the table with dog breed data
  5. Iterate through the table rows
  6. Extract data from each column
  7. Download images and save locally
  8. Print/process extracted data

Let's go through each of these steps in detail.

Includes

We begin by including the required modules:

// Include DOM extension
include 'php_dom.php';

The DOM extension allows accessing and modifying XML and HTML documents.

Send HTTP Request

To download the web page containing the dog breed table, we need to send an HTTP GET request:

// URL of page to scrape
$url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>';

// Create a stream context to set a custom user agent
$context = stream_context_create(array(
  'http' => array(
     'header' => "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
  )
));

// Send HTTP request
$response = file_get_contents($url, false, $context);

We provide a user-agent header to mimic a browser request. The file_get_contents() method returns the page content.

Parse HTML with DOMDocument

To extract data from the page, we need to parse the HTML content. We can use DOMDocument for this:

// Load HTML into DOMDocument
$doc = new DOMDocument();
$doc->loadHTML($response);

// Initialize DOM XPath
$xpath = new DOMXPath($doc);

We load the HTML into a DOMDocument and initialize DOMXPath to query it.

Find Breed Table

The dog breed data we want is contained in a table with class wikitable sortable. We can use an XPath query to find it:

// Find table by class name
$table = $xpath->query('//table[contains(@class, "wikitable") and contains(@class, "sortable")]');

This returns the

node.

Iterate Through Rows

Now we loop through the rows, skipping the first header row:

// Loop through table rows
foreach ($table->item(0)->childNodes as $row) {

  if ($row->nodeType == 1) {

    // extract data from columns
    ...

  }

}

We check for element nodes and ignore text nodes.

Extract Column Data

Inside the loop, we extract the data from each column:

// Find all cells in row
$cells = $row->childNodes;

// Extract info from cells
$name = trim($cells->item(0)->textContent);
$group = trim($cells->item(1)->textContent);

$localNameNode = $cells->item(2)->firstChild;
$localName = trim($localNameNode->textContent) if ($localNameNode);

$imgNode = $cells->item(3)->firstChild;
$photograph = $imgNode->getAttribute('src') if ($imgNode);

We use textContent to get text and getAttribute() to get image src.

Download Images

To download the images:

if ($photograph) {

  // Download image
  $image = file_get_contents($photograph, false, $context);

  // Save image to file
  $imagePath = "dog_images/{$name}.jpg";
  file_put_contents($imagePath, $image);

}

We reuse the stream context to download the image and save it to a file.

Store Extracted Data

Finally, we store the extracted data in arrays:

$names[] = $name;
$groups[] = $group;
$localNames[] = $localName;
$photographs[] = $photograph;

The arrays can then be processed or printed.

And that's it! Here is the full code:

// Full code

// Include DOM extension
include 'php_dom.php';

// URL to scrape
$url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>';

// User agent header
$context = stream_context_create(array(
  'http' => array(
     'header' => "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
  )
));

// Get page HTML
$response = file_get_contents($url, false, $context);

// Load HTML into DOMDocument
$doc = new DOMDocument();
$doc->loadHTML($response);

// Initialize DOM XPath
$xpath = new DOMXPath($doc);

// Find table by class name
$table = $xpath->query('//table[contains(@class, "wikitable") and contains(@class, "sortable")]');

// Initialize arrays
$names = [];
$groups = [];
$localNames = [];
$photographs = [];

// Loop through rows
foreach ($table->item(0)->childNodes as $row) {

  if ($row->nodeType == 1) {

    // Get cells
    $cells = $row->childNodes;

    // Extract column data
    $name = trim($cells->item(0)->textContent);
    $group = trim($cells->item(1)->textContent);

    $localNameNode = $cells->item(2)->firstChild;
    $localName = trim($localNameNode->textContent) if ($localNameNode);

    $imgNode = $cells->item(3)->firstChild;
    $photograph = $imgNode->getAttribute('src') if ($imgNode);

    // Download and save image
    if ($photograph) {

      $image = file_get_contents($photograph, false, $context);
      $imagePath = "dog_images/{$name}.jpg";
      file_put_contents($imagePath, $image);

    }

    // Store data
    $names[] = $name;
    $groups[] = $group;
    $localNames[] = $localName;
    $photographs[] = $photograph;

  }

}

This provides a complete PHP solution to scrape data and images from HTML tables. The same approach can be applied to extract data from many different websites.

While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

Browse by tags:

Browse by language:

Tired of getting blocked while scraping the web?

ProxiesAPI handles headless browsers and rotates proxies for you.
Get access to 1,000 free API credits, no credit card required!