The Complete HTML Agility Pack Cheat Sheet in VB

Oct 31, 2023 ยท 3 min read

HTML Agility Pack is an HTML parser for .NET. It allows easy manipulation and data extraction from HTML documents.

Getting Started

Install NuGet package:

Install-Package HtmlAgilityPack

Load HTML:

Dim html As String = "<html>...</html>"
Dim doc As HtmlDocument = New HtmlDocument()
doc.LoadHtml(html)

Select nodes:

Dim nodes As HtmlNodeCollection = doc.DocumentNode.SelectNodes("//div")

Get text:

Dim text As String = doc.DocumentNode.InnerText

Selecting Nodes

By CSS selector:

doc.DocumentNode.SelectNodes(".header")

By XPath:

doc.DocumentNode.SelectNodes("//table")

By tag name:

doc.DocumentNode.SelectNodes("img")

By id:

doc.GetElementbyId("header")

Virtual collections:

Dim virtualCol = doc.CreateVCollection(XPath)

Querying & Extracting

Get attribute:

Dim href As String = node.GetAttributeValue("href", "")

Get text:

Dim text As String = node.InnerText

Get HTML:

Dim html As String = node.OuterHtml

Find ancestors:

Dim parent As HtmlNode = node.ParentNode

Evaluate XPath:

doc.DocumentNode.Evaluate("//a")

Manipulation

Add node:

doc.DocumentNode.AppendChild(HtmlNode.CreateNode("<p>Hello</p>"))

Update text:

node.InnerText = "New text"

Update HTML:

node.OuterHtml = "<div>New HTML</div>"

Remove node:

node.Remove()

Add class:

node.SetAttributeValue("class", "blue")

Parsing HTML

From string:

doc.LoadHtml(htmlString)

From URL:

doc.Load(url)

From file:

doc.Load(filename)

Auto detect encoding:

doc.OptionAutoCloseOnEnd = true

Tips

  • Select nodes with XPath or CSS
  • Use InnerText for text
  • OuterHtml for full HTML
  • AppendChild to add nodes
  • Enable OptionAutoCloseOnEnd
  • Example

    Dim html = <html>
                  <body>
                    <h1>Title</h1>
                    <p>Hello World!</p>
                  </body>
                </html>
    
    Dim doc As HtmlDocument = New HtmlDocument()
    doc.LoadHtml(html)
    
    Dim title As String = doc.DocumentNode.SelectSingleNode("//h1").InnerText
    ' Title
    
    Dim text As String = doc.DocumentNode.SelectSingleNode("//p").InnerText
    ' Hello World!
    

    Advanced Querying

    XPath Axes

  • ancestor:: - selects all ancestors (parent, grandparent, etc)
  • descendant:: - selects all descendants (children, grandchildren, etc)
  • following-sibling:: - selects all siblings after the current node
  • Query by Node Type

    doc.DocumentNode.SelectNodes("//*[self::p or self::div]")
    

    Predicates

    //div[@class='header']
    

    Advanced Manipulation

    Insert Nodes

    doc.DocumentNode.InsertBefore(newNode, refNode);
    doc.DocumentNode.InsertAfter(newNode, refNode);
    

    Clone Nodes

    var clone = node.Clone();
    

    Move & Remove Nodes

    node.Remove();
    doc.DocumentNode.InsertBefore(node, refNode);
    

    Handling Documents

    Loading

    doc.Load(url);
    doc.LoadHtml(htmlString);
    doc.Load(stream);
    doc.Load(textReader);
    

    Saving

    doc.Save(filename);
    

    Options

    doc.OptionOutputAsXml = true;
    

    Working with Fragments

    doc.LoadHtml(htmlFragment);
    doc.CreateElement("div");
    

    Best Practices

  • Reuse HtmlDocument instances if possible for better performance
  • Dispose HtmlDocument when no longer needed
  • Avoid excessive XPath queries - cache result sets
  • Use for web scraping to avoid overhead of full browser load
  • Additional Tips

    doc.OptionFixNestedTags = true;
    doc.DetectEncoding(stream);
    // Integrate AngleSharp
    // Support .NET Framework + Core
    

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!