Scraping Multiple Pages in Visual Basic with HtmlAgilityPack and HttpClient

Oct 15, 2023 · 4 min read

Web scraping is useful to programmatically extract data from websites. Often you need to scrape multiple pages from a site to gather complete information. In this article, we'll see how to scrape multiple pages in Visual Basic using the HtmlAgilityPack and HttpClient libraries.

Prerequisites

To follow along, you'll need:

  • Basic VB knowledge
  • Visual Studio installed
  • HtmlAgilityPack and HttpClient NuGet packages
  • Import Libraries

    We'll need the following imports:

    Imports System.Net.Http
    Imports HtmlAgilityPack
    

    Define Base URL

    We'll scrape a blog - https://copyblogger.com/blog/. The page URLs follow a pattern:

    <https://copyblogger.com/blog/>
    <https://copyblogger.com/blog/page/2/>
    <https://copyblogger.com/blog/page/3/>
    

    Let's define the base URL pattern:

    Dim baseUrl As String = "<https://copyblogger.com/blog/page/{0}/>"
    

    The {0} allows us to insert the page number.

    Specify Number of Pages

    Next, we'll specify how many pages to scrape. Let's scrape the first 5 pages:

    Dim numPages As Integer = 5
    

    Loop Through Pages

    We can now loop from 1 to numPages and construct the URL for each page:

    For pageNum As Integer = 1 To numPages
    
      ' Construct page URL
      Dim url = String.Format(baseUrl, pageNum)
    
      ' Code to scrape each page
    
    Next
    

    Send Request and Parse HTML

    Inside the loop, we'll send a GET request and parse the HTML:

    Dim client As New HttpClient()
    Dim response As HttpResponseMessage = Await client.GetAsync(url)
    
    If response.IsSuccessStatusCode Then
    
      Dim htmlDoc As New HtmlDocument()
      htmlDoc.LoadHtml(Await response.Content.ReadAsStringAsync())
    
    End If
    

    This gives us an HTML document to extract data from.

    Extract Data

    Now within the loop we can use XPath queries to extract data from each page:

    Dim articles = htmlDoc.DocumentNode.SelectNodes("//article")
    
    For Each article As HtmlNode In articles
    
      ' Extract data from article
    
      Dim title = article.SelectSingleNode("./h2[@class='entry-title']").InnerText
      Dim url = article.SelectSingleNode("./a[@class='entry-title-link']").GetAttributeValue("href", "")
      Dim author = article.SelectSingleNode("./div[@class='post-author']/a").InnerText
    
      ' Print extracted data
      Console.WriteLine("Title: " & title)
      Console.WriteLine("URL: " & url)
      Console.WriteLine("Author: " & author)
    
    Next
    

    Full Code

    Our full code to scrape 5 pages is:

    Imports System.Net.Http
    Imports HtmlAgilityPack
    
    Module Scraper
    
        Sub Main()
    
            Dim baseUrl As String = "https://copyblogger.com/blog/page/{0}/"
            Dim numPages As Integer = 5
    
            For pageNum As Integer = 1 To numPages
    
                Dim url = String.Format(baseUrl, pageNum)
    
                Dim client As New HttpClient()
                Dim response As HttpResponseMessage = Await client.GetAsync(url)
    
                If response.IsSuccessStatusCode Then
    
                    Dim htmlDoc As New HtmlDocument()
                    htmlDoc.LoadHtml(Await response.Content.ReadAsStringAsync())
    
                    Dim articles = htmlDoc.DocumentNode.SelectNodes("//article")
    
                    For Each article As HtmlNode In articles
    
                        Dim title = article.SelectSingleNode("./h2[@class='entry-title']").InnerText
                        Dim url = article.SelectSingleNode("./a[@class='entry-title-link']").GetAttributeValue("href", "")
                        Dim author = article.SelectSingleNode("./div[@class='post-author']/a").InnerText
                        
                        Dim categories As New List(Of String)
                        For Each node As HtmlNode In article.SelectNodes("./div[@class='entry-categories']/a")
                            categories.Add(node.InnerText)
                        Next
                        
                        Console.WriteLine("Title: " & title)
                        Console.WriteLine("URL: " & url)
                        Console.WriteLine("Author: " & author)
                        Console.WriteLine("Categories: " & String.Join(", ", categories))
    
                    Next
    
                End If
    
            Next
    
        End Sub
    
    End Module

    This allows us to scrape and extract data from multiple pages sequentially in VB.NET. The code can be extended to scrape any number of pages.

    Summary

  • Use a base URL pattern with {0} placeholder
  • Loop through pages with For loop
  • Construct each page URL
  • Send request and parse HTML
  • Extract data using XPath queries
  • Print or store scraped data
  • Web scraping enables collecting large datasets programmatically. With the techniques here, you can scrape and extract information from multiple pages of a website in Visual Basic.

    While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

    Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

    This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

    With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!