Downloading Images from a Website with VB and HtmlAgilityPack

Oct 15, 2023 · 5 min read

In this article, we will learn how to use Visual Basic and the HtmlAgilityPack library to download all the images from a Wikipedia page.

—-

Overview

The goal is to extract the names, breed groups, local names, and image URLs for all dog breeds listed on this Wikipedia page. We will store the image URLs, download the images and save them to a local folder.

Here are the key steps we will cover:

  1. Add required references
  2. Send HTTP request to fetch the Wikipedia page
  3. Parse the page HTML using HtmlAgilityPack
  4. Find the table with dog breed data using a CSS selector
  5. Iterate through the table rows
  6. Extract data from each column
  7. Download images and save locally
  8. Print/process extracted data

Let's go through each of these steps in detail.

References

We need these library references:

Imports System.Net
Imports HtmlAgilityPack
  • System.Net - Provides WebClient
  • HtmlAgilityPack - Parses HTML
  • Send HTTP Request

    To download the web page:

    Dim webClient As New WebClient()
    webClient.Headers.Add("User-Agent", "VBScraper")
    
    Dim html As String = webClient.DownloadString(
        "<https://commons.wikimedia.org/wiki/List_of_dog_breeds>")
    

    We use WebClient and provide a custom user-agent header.

    Parse HTML

    To parse the HTML:

    Dim htmlDoc As New HtmlDocument()
    htmlDoc.LoadHtml(html)
    

    The HtmlDocument represents parsed HTML.

    Find Breed Table

    We use a CSS selector to find the table element:

    Dim table = htmlDoc.DocumentNode.SelectSingleNode(
        "//table[contains(@class, 'wikitable') and contains(@class, 'sortable')]")
    

    This selects the

    tag with the required CSS classes.

    Iterate Through Rows

    We can loop through the rows like this:

    For Each row As HtmlNode In table.SelectNodes("tr")
    
      ' Extract data
    
    Next
    

    We loop through each

    element within the table.

    Extract Column Data

    Inside the loop, we extract the column data:

    Dim cells = row.SelectNodes("td, th")
    
    Dim name = cells(0).SelectSingleNode("a").InnerText.Trim()
    Dim group = cells(1).InnerText.Trim()
    
    Dim localNameNode = cells(2).SelectSingleNode("span")
    Dim localName = If(localNameNode IsNot Nothing, localNameNode.InnerText.Trim(), "")
    
    Dim img = cells(3).SelectSingleNode("img")
    Dim photograph = If(img IsNot Nothing, img.GetAttributeValue("src", ""), "")
    

    We use InnerText for text and GetAttributeValue() for attributes.

    Download Images

    To download and save images:

    If photograph <> "" Then
    
      Dim imageData = webClient.DownloadData(photograph)
    
      Dim imagePath = "dog_images/" & name & ".jpg"
      File.WriteAllBytes(imagePath, imageData)
    
    End If
    

    We reuse the WebClient to download the image bytes and save them to a file.

    Store Extracted Data

    We can store the extracted data in lists:

    names.Add(name)
    groups.Add(group)
    localNames.Add(localName)
    photos.Add(photograph)
    

    The lists can then be processed as needed.

    And that's it! Here is the full code:

    ' Imports
    Imports System.Net
    Imports HtmlAgilityPack
    Imports System.IO
    
    ' Lists to store data
    Dim names As New List(Of String)
    Dim groups As New List(Of String)
    Dim localNames As New List(Of String)
    Dim photos As New List(Of String)
    
    ' HTTP request
    Dim webClient As New WebClient()
    webClient.Headers.Add("User-Agent", "VBScraper")
    
    Dim html As String = webClient.DownloadString(
        "<https://commons.wikimedia.org/wiki/List_of_dog_breeds>"
    )
    
    ' Parse HTML
    Dim htmlDoc As New HtmlDocument()
    htmlDoc.LoadHtml(html)
    
    ' Find table
    Dim table = htmlDoc.DocumentNode.SelectSingleNode(
        "//table[contains(@class, 'wikitable') and contains(@class, 'sortable')]"
    )
    
    ' Iterate rows
    For Each row As HtmlNode In table.SelectNodes("tr")
    
      ' Get cells
      Dim cells = row.SelectNodes("td, th")
    
      ' Extract data
      Dim name = cells(0).SelectSingleNode("a").InnerText.Trim()
      Dim group = cells(1).InnerText.Trim()
    
      Dim localNameNode = cells(2).SelectSingleNode("span")
      Dim localName = If(localNameNode IsNot Nothing, localNameNode.InnerText.Trim(), "")
    
      Dim img = cells(3).SelectSingleNode("img")
      Dim photograph = If(img IsNot Nothing, img.GetAttributeValue("src", ""), "")
    
      ' Download image
      If photograph <> "" Then
    
        Dim imageData = webClient.DownloadData(photograph)
    
        Dim imagePath = "dog_images/" & name & ".jpg"
        File.WriteAllBytes(imagePath, imageData)
    
      End If
    
      ' Store data
      names.Add(name)
      groups.Add(group)
      localNames.Add(localName)
      photos.Add(photograph)
    
    Next
    

    This provides a complete VB.NET solution using HtmlAgilityPack to scrape data and images from HTML tables. The same approach can apply to many websites.

    While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

    Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

    This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

    With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!