Scraping All Images from a Website with R

The first step is to load the R libraries that we will need to perform the web scraping:

library(rvest)
library(httr)
library(stringr)

The key libraries are:

rvest: For parsing and extracting data from HTML and XML

httr: For sending HTTP requests to web pages

stringr: For handling strings

Defining the URL and Headers

Next we need to specify the URL of the web page that contains the images we want to scrape:

url <- '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>'

We are scraping images of dog breeds from a Wikipedia page.

This is page we are talking about…

When scraping web pages, it is good practice to define a custom user agent header. This helps simulate a real browser request so the server will respond properly:

headers <- c(
  `User-Agent` = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
)

Here we are setting a Chrome browser user agent.

Sending the HTTP Request

To download the web page content, we can send an HTTP GET request using the httr package:

response <- httr::GET(url, httr::add_headers(headers))

This will fetch the contents of the specified url and store the response in the response object.

Checking the Response Status

It's good practice to check that the request succeeded before trying to parse the response. We can check the status code:

if (httr::status_code(response) == 200) {

  # Request succeeded logic

} else {

  # Failed request handling

}

A status code of 200 means the request was successful. Other codes indicate an error.

Parsing the HTML

Since the request succeeded, we can parse the HTML content using rvest:

page <- read_html(httr::content(response, "text"))

The page object now contains the parsed HTML document.

Finding the Data Table

Inspecting the page

You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

We can use XPath to find that table element:

table <- page %>%
  html_nodes(xpath = '//*[@class="wikitable sortable"]') %>%
  html_table()

Let's break this down:

html_nodes() finds all nodes matching the XPath selector

//*[@class="wikitable sortable"] selects elements with a class attribute matching "wikitable sortable"

html_table() converts the HTML table into a data frame

Now the table data is extracted into the table data frame.

Initializing Data Storage

As we scrape data from the table, we need variables to accumulate the results:

names <- character()
groups <- character()
local_names <- character()
photographs <- character()

Empty vectors are created to store the dog name, breed group, local names, and image URLs as we extract them.

Iterating Through the Table Rows

To scrape the data from each row, we can iterate through the table:

for (i in 2:length(table[[1]][, 1])) {

  row <- table[[1]][i, ]

  # Extract data for each dog breed

}

This skips the header row and processes each data row, storing the current row in row.

Extracting Data from Each Column

Now here is the most complex part - extracting each data field from the table columns:

# Column 1: Name
name <- row[[1]]

# Column 2: Group
group <- row[[2]]

# Check column 3 for a <span> tag
span_tag <- html_nodes(row[[3]], 'span')
local_name <- ifelse(length(span_tag) > 0, html_text(span_tag), '')

# Check column 4 for an <img>
img_tag <- html_nodes(row[[4]], 'img')
photograph <- ifelse(length(img_tag) > 0, html_attr(img_tag, 'src'), '')

As you can see, each column requires different logic to extract the text or attributes. Let's break it down:

Name Column:

The name is directly in the text of column 1. We grab it with:

name <- row[[1]]

Group Column:

The group is also basic text, extracted by:

group <- row[[2]]

Local Name Column:

For local names, we first check if the column contains a tag:

span_tag <- html_nodes(row[[3]], 'span')

If found, we extract its text:

local_name <- ifelse(length(span_tag) > 0, html_text(span_tag), '')

Photograph Column:

Finally, for the photo we check if an image tag exists:

img_tag <- html_nodes(row[[4]], 'img')

If yes, we grab its source URL attribute:

photograph <- ifelse(length(img_tag) > 0, html_attr(img_tag, 'src'), '')

This logic carefully handles all the edge cases that can appear when scraping semi-structured HTML.

Downloading and Saving Images

With the image URLs extracted, we can now download and save the photos:

if (photograph != '') {

  # Download image
  # Save to file

}

The code checks that we have a valid photograph URL before proceeding.

We won't include all the image download code here for brevity.

Printing the Extracted Data

Finally, to print out the scraped data:

for (i in 1:length(names)) {

  cat("Name:", names[i], "\\n")
  cat("FCI Group:", groups[i], "\\n")
  cat("Local Name:", local_names[i], "\\n")
  cat("Photograph:", photographs[i], "\\n")

  cat("\\n")

}

This iterates through each record and prints the extracted fields.

Handling Errors

The code also contains logic to handle errors:

} else {

  cat("Failed to retrieve the web page. Status code:", httr::status_code(response), "\\n")

}

If the HTTP request failed, it prints an error message with the status code.

Full Code

# Load the required libraries
library(rvest)
library(httr)
library(stringr)

# URL of the Wikipedia page
url <- 'https://commons.wikimedia.org/wiki/List_of_dog_breeds'

# Define a user-agent header to simulate a browser request
headers <- c(
  `User-Agent` = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
)

# Send an HTTP GET request to the URL with the headers
response <- httr::GET(url, httr::add_headers(headers))

# Check if the request was successful (status code 200)
if (httr::status_code(response) == 200) {
  # Parse the HTML content of the page
  page <- read_html(httr::content(response, "text"))

  # Find the table with class 'wikitable sortable'
  table <- page %>%
    html_nodes(xpath = '//*[@class="wikitable sortable"]') %>%
    html_table()

  # Initialize lists to store the data
  names <- character()
  groups <- character()
  local_names <- character()
  photographs <- character()

  # Create a folder to save the images
  dir.create('dog_images', showWarnings = FALSE)

  # Iterate through rows in the table (skip the header row)
  for (i in 2:length(table[[1]][, 1])) {
    row <- table[[1]][i, ]
    
    # Extract data from each column
    name <- row[[1]]
    group <- row[[2]]
    
    # Check if the second column contains a span element
    span_tag <- html_nodes(row[[3]], 'span')
    local_name <- ifelse(length(span_tag) > 0, html_text(span_tag), '')
    
    # Check for the existence of an image tag within the fourth column
    img_tag <- html_nodes(row[[4]], 'img')
    photograph <- ifelse(length(img_tag) > 0, html_attr(img_tag, 'src'), '')
    
    # Download the image and save it to the folder
    if (photograph != '') {
      image_url <- photograph
      image_response <- httr::GET(image_url, httr::add_headers(headers))
      if (httr::status_code(image_response) == 200) {
        image_filename <- file.path('dog_images', paste0(name, '.jpg'))
        writeBin(httr::content(image_response, "raw"), image_filename)
      }
    }
    
    # Append data to respective lists
    names <- c(names, name)
    groups <- c(groups, group)
    local_names <- c(local_names, local_name)
    photographs <- c(photographs, photograph)
  }

  # Print or process the extracted data as needed
  for (i in 1:length(names)) {
    cat("Name:", names[i], "\n")
    cat("FCI Group:", groups[i], "\n")
    cat("Local Name:", local_names[i], "\n")
    cat("Photograph:", photographs[i], "\n")
    cat("\n")
  }

} else {
  cat("Failed to retrieve the web page. Status code:", httr::status_code(response), "\n")
}

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Scraping All Images from a Website with R

Defining the URL and Headers

Sending the HTTP Request

Checking the Response Status

Parsing the HTML

Finding the Data Table

Inspecting the page

Initializing Data Storage

Iterating Through the Table Rows

Extracting Data from Each Column

Downloading and Saving Images

Printing the Extracted Data

Handling Errors

Full Code

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping All Images from a Website with R

Defining the URL and Headers

Sending the HTTP Request

Checking the Response Status

Parsing the HTML

Finding the Data Table

Inspecting the page

Initializing Data Storage

Iterating Through the Table Rows

Extracting Data from Each Column

Downloading and Saving Images

Printing the Extracted Data

Handling Errors

Full Code

The easiest way to do Web Scraping

Don't leave just yet!