The Ultimate Cheat Sheet for HtmlAgilityPack in CSharp

HtmlAgilityPack allows fast and robust manipulation of HTML documents in .NET. This cheat sheet aims to be the most in-depth reference possible for working with HtmlAgilityPack.

Installation

PM> Install-Package HtmlAgilityPack

Loading HTML

From string:

var doc = new HtmlDocument();
doc.LoadHtml("<html>...</html>");

From file:

doc.Load("page.html");

From stream:

using(var fs = File.OpenRead("page.html")) {
  doc.Load(fs);
}

From web:

doc.Load("<http://example.com>");

Custom options:

doc.OptionFixNestedTags = true;

Helper method:

private static HtmlDocument LoadHtml(string html) {
  var doc = new HtmlDocument();
  doc.LoadHtml(html);
  return doc;
}

Selecting Nodes

By CSS selector:

var paras = doc.DocumentNode
              .SelectNodes("//p");

By XPath:

var items = doc.DocumentNode
             .SelectNodes("//ul/li");

Get single element by ID:

var content = doc.GetElementbyId("content");

Get elements by tag name:

var divs = doc.GetElementsByTagName("div");

Evaluate XPath:

var xpath = "//div/p";
var nodes = doc.DocumentNode.Evaluate(xpath);

Looping Nodes

For each loop:

foreach(var item in items) {
  // ...
}

For loop:

for(int i = 0; i < items.Count; i++) {
  var item = items[i];
}

While loop:

int i = 0;
while(node = nodes[i++]) {
  // ...
}

Modifying Nodes

Get attribute value:

var cls = el.GetAttributeValue("class", null);

Set attribute value:

el.SetAttributeValue("class", "blue");

Get inner text:

var text = el.InnerText;

Set inner text:

el.InnerText = "Hello World";

Get inner HTML:

var html = el.InnerHtml;

Set inner HTML:

el.InnerHtml = "<strong>Hello</strong>";

Creating Nodes

Create element:

var el = doc.CreateElement("p");

Create text node:

var text = doc.CreateTextNode("Hello");

Create document fragment:

var frag = doc.CreateDocumentFragment();

Create from HTML:

var frag = doc.ParseFragment("<b>Hi!</b>");

Inserting Nodes

Append child element:

parent.AppendChild(el);

Insert before element:

parent.InsertBefore(newEl, el);

Insert after element:

parent.InsertAfter(newEl, el);

Prepend child element:

parent.PrependChild(el);

Insert adjacent HTML:

el.InsertAdjacentHtml("beforebegin", "<p>Hello</p>");

Removing Nodes

Remove single element:

parent.RemoveChild(el);

Remove all children:

parent.RemoveAllChildren();

Remove nodes by ID:

doc.DocumentNode.Descendants("p")
   .Where(p => p.Id == "intro")
   .ToList()
   .ForEach(p => p.Remove());

Remove all nodes:

doc.DocumentNode.RemoveAll();

Loading Sub-Documents

Parse HTML fragment:

var frag = doc.ParseFragment("<b>Hi!</b>");

Append parsed fragment:

doc.DocumentNode.AppendChild(frag);

Load partial document:

var newDoc = new HtmlDocument();
newDoc.Load(doc.DocumentNode);

Namespaces

doc.DocumentNode.RegisterNamespace("h", "<http://example.com/ns/>");

Get namespaced nodes:

var nodes = doc.DocumentNode
              .SelectNodes("//h:element");

DOM Traversal

Parent node:

var parent = node.ParentNode;

Child nodes:

var children = parent.ChildNodes;

Next sibling:

var nextSibling = node.NextSibling;

Previous sibling:

var prevSibling = node.PreviousSibling;

Caching XPath Queries

Don't reparse queries:

// Reusable query
private static string ParasXpath = "//p";

var nodes = doc.DocumentNode.SelectNodes(ParasXpath);

// Later...

var moreNodes = doc.DocumentNode.SelectNodes(ParasXpath);

Validation

DTD validate:

doc.OptionValidateDTD = true;
doc.LoadHtml(html); // Throws on error

XSD validate:

doc.Validate(schemaStream); // Returns issues

Encoding

Load as UTF-8:

doc.OptionDefaultStreamEncoding = Encoding.UTF8;

Special characters:

doc.DocumentNode.SelectNodes("//p/text()[contains(., 'en dash –')]");

LINQ Integration

LINQ query:

var paras = from p in doc.DocumentNode.Descendants("p")
            where !p.HasClass("intro")
            select p.InnerText;

Extension methods:

doc.DocumentNode.Descendants("p")
   .Where(p => !p.HasClass("intro"))
   .Select(p => p.InnerText);

Real World Use Cases

Web scraping scripts

Parsers, converters, transformers

Automated testing bots

Site scrapers and crawlers

Architect headless sites

Data extraction from reports

PDF generation

Web automation scripts

Comparing HTML documents

Building HTML editors

Feed readers

Web screenshot tools

Archiving sites

Analyzing SEO metadata

Processing HTML datasets

This covers the full range of capabilities and best practices for parsing, traversing, and modifying HTML documents with HtmlAgilityPack in C#!

The Ultimate Cheat Sheet for HtmlAgilityPack in CSharp

Installation

Loading HTML

Selecting Nodes

Looping Nodes

Modifying Nodes

Creating Nodes

Inserting Nodes

Removing Nodes

Loading Sub-Documents

Namespaces

DOM Traversal

Caching XPath Queries

Validation

Encoding

LINQ Integration

Real World Use Cases

Browse by language:

The easiest way to do Web Scraping

The Ultimate Cheat Sheet for HtmlAgilityPack in CSharp

Installation

Loading HTML

Selecting Nodes

Looping Nodes

Modifying Nodes

Creating Nodes

Inserting Nodes

Removing Nodes

Loading Sub-Documents

Namespaces

DOM Traversal

Caching XPath Queries

Validation

Encoding

LINQ Integration

Real World Use Cases

The easiest way to do Web Scraping

Don't leave just yet!