Downloading Images from a Website with Python and BeautifulSoup

In this article, we will learn how to use Python and the BeautifulSoup module to download all the images from a Wikipedia page.

—-

Overview

The goal is to extract the names, breed groups, local names, and image URLs for all dog breeds listed on this Wikipedia page. We will store the image URLs, download the images and save them to a local folder.

Here are the key steps we will cover:

Import required modules
Send HTTP request to fetch the Wikipedia page
Parse the page HTML using BeautifulSoup
Find the table with dog breed data
Iterate through the table rows
Extract data from each column
Download images and save locally
Print/process extracted data

Let's go through each of these steps in detail.

Imports

We begin by importing the required modules:

import os
import requests
from bs4 import BeautifulSoup

os - provides functions to interact with the file system

requests - sends HTTP requests to URLs

BeautifulSoup - parses HTML and XML documents

Make sure you have these modules installed before running the code.

Send HTTP Request

To download the web page containing the dog breed table, we need to send an HTTP GET request:

url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>'

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}

response = requests.get(url, headers=headers)

We provide a user-agent header to mimic a browser request. The requests.get() method returns a Response object containing the page content and other metadata.

Parse HTML with BeautifulSoup

To extract data from the page, we need to parse the HTML content. BeautifulSoup makes this easy:

if response.status_code == 200:

    soup = BeautifulSoup(response.text, 'html.parser')

The status_code of 200 means the request was successful. We pass the response.text containing raw HTML to the BeautifulSoup constructor.

Find Breed Table

The dog breed data we want is contained in a table with class wikitable sortable. We can use a CSS selector to find it:

table = soup.find('table', {'class': 'wikitable sortable'})

This returns a BeautifulSoup object containing the

element.

Iterate Through Rows

Now we loop through the rows, skipping the header row:

for row in table.find_all('tr')[1:]:

    # extract data from columns
    ...

The find_all() method returns all rows as a list. We slice from index 1 to skip the headers.

Extract Column Data

Inside the loop, we extract the data from each column:

columns = row.find_all(['td', 'th'])

name = columns[0].find('a').text.strip()
group = columns[1].text.strip()

span_tag = columns[2].find('span')
local_name = span_tag.text.strip() if span_tag else ''

img_tag = columns[3].find('img')
photograph = img_tag['src'] if img_tag else ''

We find all

and	elements and get the text from them. For the image URL, we check if an tag exists and get its src attribute. Download Images To download the images: `if photograph: response = requests.get(photograph) if response.status_code == 200: image_filename = os.path.join('dog_images', f'{name}.jpg') with open(image_filename, 'wb') as img_file: img_file.write(response.content)` We send another GET request, this time using the image URL. If successful, we construct a file path using the breed name, write the image bytes to that file. Store Extracted Data Finally, we store the extracted data in lists: `names.append(name) groups.append(group) local_names.append(local_name) photographs.append(photograph)` These lists can then be processed or printed as needed. And that's it! Here is the full code: # Full code import os import requests from bs4 import BeautifulSoup url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>' headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" } response = requests.get(url, headers=headers) if response.status_code == 200: soup = BeautifulSoup(response.text, 'html.parser') table = soup.find('table', {'class': 'wikitable sortable'}) names = [] groups = [] local_names = [] photographs = [] os.makedirs('dog_images', exist_ok=True) for row in table.find_all('tr')[1:]: columns = row.find_all(['td', 'th']) name = columns[0].find('a').text.strip() group = columns[1].text.strip() span_tag = columns[2].find('span') local_name = span_tag.text.strip() if span_tag else '' img_tag = columns[3].find('img') photograph = img_tag['src'] if img_tag else '' if photograph: response = requests.get(photograph) if response.status_code == 200: image_filename = os.path.join('dog_images', f'{name}.jpg') with open(image_filename, 'wb') as img_file: img_file.write(response.content) names.append(name) groups.append(group) local_names.append(local_name) photographs.append(photograph) This gives you a complete template to extract data from HTML tables into structured lists and download associated images. The same technique can be applied to many different websites. While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help. Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself. This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping. With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked. Browse by language: C# PHP Python JavaScript Rust Ruby Go C++ Objective-C Scala Elixir Kotlin Perl R Java The easiest way to do Web Scraping Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you Try ProxiesAPI for free curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> <html> <head> <title>Example Domain</title> <meta charset="utf-8" /> <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> ... Tired of getting blocked while scraping the web? Get access to 1,000 free API credits, no credit card required! Try for free X Don't leave just yet! Enter your email below to claim your free API key:

and

elements and get the text from them. For the image URL, we check if an tag exists and get its src attribute.

Download Images

To download the images:

if photograph:

    response = requests.get(photograph)

    if response.status_code == 200:

        image_filename = os.path.join('dog_images', f'{name}.jpg')

        with open(image_filename, 'wb') as img_file:

            img_file.write(response.content)

We send another GET request, this time using the image URL. If successful, we construct a file path using the breed name, write the image bytes to that file.

Store Extracted Data

Finally, we store the extracted data in lists:

names.append(name)
groups.append(group)
local_names.append(local_name)
photographs.append(photograph)

These lists can then be processed or printed as needed.

And that's it! Here is the full code:

# Full code

import os
import requests
from bs4 import BeautifulSoup

url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>'

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}

response = requests.get(url, headers=headers)

if response.status_code == 200:

    soup = BeautifulSoup(response.text, 'html.parser')

    table = soup.find('table', {'class': 'wikitable sortable'})

    names = []
    groups = []
    local_names = []
    photographs = []

    os.makedirs('dog_images', exist_ok=True)

    for row in table.find_all('tr')[1:]:

        columns = row.find_all(['td', 'th'])

        name = columns[0].find('a').text.strip()
        group = columns[1].text.strip()

        span_tag = columns[2].find('span')
        local_name = span_tag.text.strip() if span_tag else ''

        img_tag = columns[3].find('img')
        photograph = img_tag['src'] if img_tag else ''

        if photograph:

            response = requests.get(photograph)

            if response.status_code == 200:

                image_filename = os.path.join('dog_images', f'{name}.jpg')

                with open(image_filename, 'wb') as img_file:

                    img_file.write(response.content)

        names.append(name)
        groups.append(group)
        local_names.append(local_name)
        photographs.append(photograph)

This gives you a complete template to extract data from HTML tables into structured lists and download associated images. The same technique can be applied to many different websites.

While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

Browse by language:

The easiest way to do Web Scraping

Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you

Try ProxiesAPI for free

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
...