The Ultimate Cheat Sheet for HtmlAgilityPack in CSharp

Oct 31, 2023 · 4 min read

HtmlAgilityPack allows fast and robust manipulation of HTML documents in .NET. This cheat sheet aims to be the most in-depth reference possible for working with HtmlAgilityPack.

Installation

PM> Install-Package HtmlAgilityPack

Loading HTML

From string:

var doc = new HtmlDocument();
doc.LoadHtml("<html>...</html>");

From file:

doc.Load("page.html");

From stream:

using(var fs = File.OpenRead("page.html")) {
  doc.Load(fs);
}

From web:

doc.Load("<http://example.com>");

Custom options:

doc.OptionFixNestedTags = true;

Helper method:

private static HtmlDocument LoadHtml(string html) {
  var doc = new HtmlDocument();
  doc.LoadHtml(html);
  return doc;
}

Selecting Nodes

By CSS selector:

var paras = doc.DocumentNode
              .SelectNodes("//p");

By XPath:

var items = doc.DocumentNode
             .SelectNodes("//ul/li");

Get single element by ID:

var content = doc.GetElementbyId("content");

Get elements by tag name:

var divs = doc.GetElementsByTagName("div");

Evaluate XPath:

var xpath = "//div/p";
var nodes = doc.DocumentNode.Evaluate(xpath);

Looping Nodes

For each loop:

foreach(var item in items) {
  // ...
}

For loop:

for(int i = 0; i < items.Count; i++) {
  var item = items[i];
}

While loop:

int i = 0;
while(node = nodes[i++]) {
  // ...
}

Modifying Nodes

Get attribute value:

var cls = el.GetAttributeValue("class", null);

Set attribute value:

el.SetAttributeValue("class", "blue");

Get inner text:

var text = el.InnerText;

Set inner text:

el.InnerText = "Hello World";

Get inner HTML:

var html = el.InnerHtml;

Set inner HTML:

el.InnerHtml = "<strong>Hello</strong>";

Creating Nodes

Create element:

var el = doc.CreateElement("p");

Create text node:

var text = doc.CreateTextNode("Hello");

Create document fragment:

var frag = doc.CreateDocumentFragment();

Create from HTML:

var frag = doc.ParseFragment("<b>Hi!</b>");

Inserting Nodes

Append child element:

parent.AppendChild(el);

Insert before element:

parent.InsertBefore(newEl, el);

Insert after element:

parent.InsertAfter(newEl, el);

Prepend child element:

parent.PrependChild(el);

Insert adjacent HTML:

el.InsertAdjacentHtml("beforebegin", "<p>Hello</p>");

Removing Nodes

Remove single element:

parent.RemoveChild(el);

Remove all children:

parent.RemoveAllChildren();

Remove nodes by ID:

doc.DocumentNode.Descendants("p")
   .Where(p => p.Id == "intro")
   .ToList()
   .ForEach(p => p.Remove());

Remove all nodes:

doc.DocumentNode.RemoveAll();

Loading Sub-Documents

Parse HTML fragment:

var frag = doc.ParseFragment("<b>Hi!</b>");

Append parsed fragment:

doc.DocumentNode.AppendChild(frag);

Load partial document:

var newDoc = new HtmlDocument();
newDoc.Load(doc.DocumentNode);

Namespaces

Register namespace:

doc.DocumentNode.RegisterNamespace("h", "<http://example.com/ns/>");

Get namespaced nodes:

var nodes = doc.DocumentNode
              .SelectNodes("//h:element");

DOM Traversal

Parent node:

var parent = node.ParentNode;

Child nodes:

var children = parent.ChildNodes;

Next sibling:

var nextSibling = node.NextSibling;

Previous sibling:

var prevSibling = node.PreviousSibling;

Caching XPath Queries

Don't reparse queries:

// Reusable query
private static string ParasXpath = "//p";

var nodes = doc.DocumentNode.SelectNodes(ParasXpath);

// Later...

var moreNodes = doc.DocumentNode.SelectNodes(ParasXpath);

Validation

DTD validate:

doc.OptionValidateDTD = true;
doc.LoadHtml(html); // Throws on error

XSD validate:

doc.Validate(schemaStream); // Returns issues

Encoding

Load as UTF-8:

doc.OptionDefaultStreamEncoding = Encoding.UTF8;

Special characters:

doc.DocumentNode.SelectNodes("//p/text()[contains(., 'en dash –')]");

LINQ Integration

LINQ query:

var paras = from p in doc.DocumentNode.Descendants("p")
            where !p.HasClass("intro")
            select p.InnerText;

Extension methods:

doc.DocumentNode.Descendants("p")
   .Where(p => !p.HasClass("intro"))
   .Select(p => p.InnerText);

Real World Use Cases

  • Web scraping scripts
  • Parsers, converters, transformers
  • Automated testing bots
  • Site scrapers and crawlers
  • Architect headless sites
  • Data extraction from reports
  • PDF generation
  • Web automation scripts
  • Comparing HTML documents
  • Building HTML editors
  • Feed readers
  • Web screenshot tools
  • Archiving sites
  • Analyzing SEO metadata
  • Processing HTML datasets
  • This covers the full range of capabilities and best practices for parsing, traversing, and modifying HTML documents with HtmlAgilityPack in C#!

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!