Here is a step-by-step guide to scraping a website for images using Elixir. This article will explain the code for scraping dog breed information and images from a Wikipedia page, to help beginners understand the key concepts.

This is page we are talking about…

Overview

The goal of this scraper is to extract dog breed names, details like categories, local names, and images from a Wikipedia page listing hundreds of breeds.

It will:

Retrieve the web page content
Parse the page to extract information
Download all images of dog breeds
Save images and print extracted data

The code uses the Elixir programming language along with several libraries:

HTTPClient - for making HTTP requests to get web page content

URI - to parse the page URL

Floki - to parse HTML and extract data

File - to save images and data to files

Retrieving the Web Page

The first step is to retrieve the content of the web page that contains the data we want to scrape.

The get_page/2 function makes an HTTP GET request to the URL using the HTTPClient library:

defp get_page(url, headers) do
  case :httpc.request(:get, {URI.parse(url), headers: headers}, [], []) do
    {:ok, {{_, 200, _},_ , body}} ->
      {:ok, body}
    {:ok, {{_, status_code, _},_ , _}} ->
      {:error, status_code}
    {:error, reason} ->
      {:error, reason}
  end
end

This makes the request, checks the status code, and if a 200 OK response is received, returns the page body.

The headers contain a user agent string to identify the scraper to the server.

The start function calls this getter, handling any errors:

case get_page(@url, headers) do
  {:ok, body} ->
    # parse page
  {:error, reason} ->
    IO.puts("Failed to retrieve the web page. Status code: #{reason}")
end

So at this point if successful, the body contains the full HTML of the web page.

Parsing the Page

Inspecting the page

You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

Selecting the Table We use the Floki.find/2 function to locate this table:

table = Floki.find(document, "table.wikitable.sortable")

The table variable now contains the HTML representation of the table we want to scrape data from.

Iterating Through Rows Inside the table, data is organized in rows, with each row containing information about a specific dog breed. We use a loop to iterate through these rows and extract relevant data:

for row <- tl(Floki.find(table, "tr")) do
  # Extract data from the row
end

The tl/1 function is used to skip the table header row, as it doesn't contain the data we need.

Extracting Data from Columns Within each row, data is stored in columns. We use Floki.find/2 to locate and extract data from these columns. Each row contains four columns: Name, FCI Group, Local Name, and Photograph.

columns = Floki.find(row, "td,th")

name = Floki.find(columns |> hd, "a") |> hd |> Floki.text() |> String.trim()
group = columns |> Enum.at(1) |> Floki.text() |> String.trim()
local_name = case Floki.find(columns |> Enum.at(2), "span") do
  [] -> ""
  [span] -> Floki.text(span) |> String.trim()
end

img_tag = Floki.find(columns |> Enum.at(3), "img")
photograph = case img_tag do
  [] -> ""
  [img] -> Floki.attribute(img, "src")
end

Here's what each extraction step does:

name: Extracts the breed's name by locating an anchor tag in the first column and trimming any extra spaces.

group: Extracts the FCI Group from the second column and trims extra spaces.

local_name: Extracts the Local Name from the third column (if available) by targeting a tag.

photograph: Extracts the Photograph URL from the fourth column by finding an tag and retrieving its "src" attribute.

Downloading Images

After extracting image sources, we can download the actual image data:

defp download_image(photograph, name) do
  case get_image(photograph) do
    {:ok, image_data} ->
      image_filename = "dog_images/#{name}.jpg"
      File.write(image_filename, image_data)
    _ ->
      IO.puts("Failed to download image: #{photograph}")
  end
end

defp get_image(url) do
  # make HTTP request
  case :httpc.request(...) do
    {:ok, {{_, 200, _},_ , body}} ->
      {:ok, body}
    _ ->
      {:error, "Failed to download image"}
  end
end

We reuse the HTTPClient library to fetch each image by URL.

If successful, we write the image binary data to a file using the breed's name and the File module.

The save_images/1 function coordinates calling this for every image URL extracted earlier.

Saving and Printing Output

Finally, save_images/1 stores images while print_data/1 prints out all extracted breed data for debugging and verification.

The full code can be seen below, showing how these pieces fit together into a complete scraper:

defmodule DogBreedsScraper do
  @url 'https://commons.wikimedia.org/wiki/List_of_dog_breeds'

  def start do
    headers = [
      {"User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"}
    ]

    case get_page(@url, headers) do
      {:ok, body} ->
        case parse_page(body) do
          {:ok, data} ->
            save_images(data)
            print_data(data)
          {:error, reason} ->
            IO.puts("Failed to parse the page: #{reason}")
        end
      {:error, reason} ->
        IO.puts("Failed to retrieve the web page. Status code: #{reason}")
    end
  end

  defp get_page(url, headers) do
    case :httpc.request(:get, {URI.parse(url), headers: headers}, [], []) do
      {:ok, {{_, 200, _}, _, body}} ->
        {:ok, body}
      {:ok, {{_, status_code, _}, _, _}} ->
        {:error, status_code}
      {:error, reason} ->
        {:error, reason}
    end
  end

  defp parse_page(body) do
    case Floki.parse(body) do
      {:ok, document} ->
        table = Floki.find(document, "table.wikitable.sortable")

        names = []
        groups = []
        local_names = []
        photographs = []

        for row <- tl(Floki.find(table, "tr")) do
          columns = Floki.find(row, "td,th")

          if length(columns) == 4 do
            name = Floki.find(columns |> hd, "a") |> hd |> Floki.text() |> String.trim()
            group = columns |> Enum.at(1) |> Floki.text() |> String.trim()
            local_name = case Floki.find(columns |> Enum.at(2), "span") do
              [] -> ""
              [span] -> Floki.text(span) |> String.trim()
            end

            img_tag = Floki.find(columns |> Enum.at(3), "img")
            photograph = case img_tag do
              [] -> ""
              [img] -> Floki.attribute(img, "src")
            end

            names = [name | names]
            groups = [group | groups]
            local_names = [local_name | local_names]
            photographs = [photograph | photographs]

            if photograph != "" do
              download_image(photograph, name)
            end
          end
        end

        {:ok, Enum.reverse(names), Enum.reverse(groups), Enum.reverse(local_names), Enum.reverse(photographs)}
      _ ->
        {:error, "Failed to parse the page"}
    end
  end

  defp download_image(photograph, name) do
    case get_image(photograph) do
      {:ok, image_data} ->
        image_filename = "dog_images/#{name}.jpg"
        File.write(image_filename, image_data)
      _ ->
        IO.puts("Failed to download image: #{photograph}")
    end
  end

  defp get_image(url) do
    case :httpc.request(:get, {URI.parse(url)}, [], []) do
      {:ok, {{_, 200, _}, _, body}} ->
        {:ok, body}
      _ ->
        {:error, "Failed to download image"}
    end
  end

  defp save_images(data) do
    File.mkdir_p("dog_images")
    Enum.zip(data |> elem(0), data |> elem(3))
    |> Enum.each(fn {name, photograph} -> download_image(photograph, name) end)
  end

  defp print_data({names, groups, local_names, photographs}) do
    Enum.each(0..(length(names) - 1), fn i ->
      IO.puts("Name: #{Enum.at(names, i)}")
      IO.puts("FCI Group: #{Enum.at(groups, i)}")
      IO.puts("Local Name: #{Enum.at(local_names, i)}")
      IO.puts("Photograph: #{Enum.at(photographs, i)}")
      IO.puts()
    end)
  end
end

# Start the scraping process
DogBreedsScraper.start()

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Scraping All Images from a Website with Elixir

Overview

Retrieving the Web Page

Parsing the Page

Inspecting the page

Downloading Images

Saving and Printing Output

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping All Images from a Website with Elixir

Overview

Retrieving the Web Page

Parsing the Page

Inspecting the page

Downloading Images

Saving and Printing Output

The easiest way to do Web Scraping

Don't leave just yet!