The Complete Libxml2 C++ Cheatsheet

Libxml2 is a XML processing library written in C for use in C/C++ applications. It provides DOM, SAX, XMLReader, XPath and XPointer support.

Getting Started

Include:

#include <libxml/parser.h>
#include <libxml/tree.h>

Parse:

xmlDocPtr doc = xmlParseFile("file.xml");

xmlDocPtr doc = xmlParseMemory(xml, size);

Validate:

xmlSchemaPtr schema = xmlSchemaNewParserCtxt("schema.xsd");
xmlSchemaValidCtxtPtr valid = xmlSchemaNewValidCtxt(schema);
xmlSchemaValidateDoc(valid, doc);

Check xmlSchemaValidCtxtGetParserErrors() for errors.

Cleanup:

xmlFreeDoc(doc);
xmlSchemaFreeValidCtxt(valid);
xmlSchemaFree(schema);

DOM Parsing

Get root element:

xmlNodePtr root = xmlDocGetRootElement(doc);

Iterate children:

for(xmlNodePtr cur = root->children; cur != NULL; cur = cur->next) {
  // process cur node
}

Get child:

xmlNodePtr child = root->children;

Node Types

xmlNode: Base node class.

xmlElem: Element nodes.

xmlText: Text nodes.

xmlAttr: Attribute nodes.

xmlNs: Namespace nodes.

Node Operations

Add child:

xmlNodePtr child = xmlNewChild(parent, NULL, "node", NULL);

Set/get properties:

xmlSetProp(node, "key", "value");
xmlGetProp(node, "key");

Set/get content:

xmlNodeSetContent(node, "text");
xmlNodeGetContent(node);

Remove node:

xmlUnlinkNode(node);
xmlFreeNode(node);

XPath Usage

Evaluate xpath:

xmlXPathContextPtr ctxt = xmlXPathNewContext(doc);
xmlXPathObjectPtr result = xmlXPathEvalExpression(ctxt, "/root/node");

Get node result:

if(result->nodesetval->nodeNr > 0) {
  xmlNodePtr node = result->nodesetval->nodeTab[0];
}

Get string result:

if(result->type == XPATH_STRING) {
  xmlChar *str = result->stringval;
}

Cleanup:

xmlXPathFreeObject(result);
xmlXPathFreeContext(ctxt);

SAX Parsing

Create parser:

xmlSAXHandler sax;
memset(sax, 0, sizeof(sax));

xmlSAXParserCreate(&sax, NULL);

Set handlers:

sax.startDocument = &startDocHandler;
sax.endElement = &endElementHandler;

Parse:

xmlSAXUserParseFile(&sax, "file.xml");

Tips

Use xmlFree() to free nodes

Check return values for errors

Validate with schemas before processing

Mind encoding when outputting XML

Avoid XPath injection from user input

Examples

Modify XML:

xmlDocPtr doc = xmlParseFile("data.xml");

xmlNodePtr root = xmlDocGetRootElement(doc);

xmlNodePtr node = xmlNewChild(root, NULL, "newNode", NULL);
xmlSetProp(node, "key", "value");

xmlSaveFile(doc, "out.xml");
xmlFreeDoc(doc);

Extract text:

xmlXPathContextPtr ctxt = xmlXPathNewContext(doc);
xmlXPathObjectPtr result = xmlXPathEvalExpression(ctxt, "/root/node/text()");

if(result->type == XPATH_STRING) {
  std::cout << result->stringval << std::endl;
}

xmlXPathFreeObject(result);
xmlXPathFreeContext(ctxt);

Namespaces

Register namespace:

xmlNewNs(node, "<http://ns>", "ns");

Add with namespace:

xmlNewChild(node, ns, "ns:child", NULL);

Search with namespace:

xpath = "/ns:root/ns:node";

HTML Parsing

Parse HTML:

htmlDocPtr doc = htmlReadFile("file.html", NULL, HTML_PARSE_NOERROR);

Print HTML:

htmlDocDump(stdout, doc);

Tidy:

htmlDocPtr tidy = htmlReadDoc(doc, "utf8", htmlTidyDocDefaultOptions);

Advanced Usage

Custom streams:

xmlParserCtxtPtr ctxt = xmlCreatePushParserCtxt(&sax, NULL, NULL, 0, NULL);

while(moreData) {
  xmlParseChunk(ctxt, data, size, 0);
}

xmlParseChunk(ctxt, NULL, 0, 1); // end

Custom memory:

xmlMemSetup(xmlFree, xmlMalloc, xmlRealloc, xmlStrdup);

Debug memory:

xmlMemUsed(); // check used mem

Memory Management

Proper memory management is critical when using libxml2 to avoid leaks.

Free document trees:

xmlFreeDoc(doc);

Frees the entire document tree.

Free nodes:

xmlFreeNode(node);

Frees a specific node. Parent links and children aren't modified.

Avoid leaks:

Free document when no longer needed.

Free nodes after removing from tree.

Set nodes to NULL after freeing.

Encoding Handling

Parse with encoding:

doc = htmlReadDoc(buffer, "UTF-8", XML_PARSE_NOERROR);

Output encoding:

xmlSaveFormatFileEnc(file, doc, "UTF-8", 1);

Avoid encoding issues:

Explicitly set encoding on parse and output.

Use UTF-8 internally if possible.

Use iconv for conversions.

Advanced XPath

Predicates:

/book[author='James']

Axes:

//ancestor::chapter

Functions:

count(//book)

Namespaces

Register namespace:

xmlNewNs(node, "<http://ns>", "ns");

Use in XPath:

/ns:book/ns:title

Default namespace:

<root xmlns="<http://ns>">

Now unprefixed elements like refer to the default namespace.

Performance

Parser options:

xmlReadDoc(doc, "nonet", XML_PARSE_NOENT);

Disables network access and entity substitution.

Reuse contexts:

Avoid creating new xpathContext for each query.

Cache nodes/results:

Cache costly lookups or searches.

Troubleshooting

HTML parse errors:

Use XML_PARSE_RECOVER to recover from common HTML errors.

XPath type errors:

Cast string results when needed.

string(//title)

Memory leaks:

Use valgrind, instrumentation, logging to detect unreleased memory.

The Complete Libxml2 C++ Cheatsheet

Getting Started

DOM Parsing

Node Types

Node Operations

XPath Usage

SAX Parsing

Tips

Examples

Namespaces

HTML Parsing

Advanced Usage

Memory Management

Encoding Handling

Advanced XPath

Namespaces

Performance

Troubleshooting

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

The Complete Libxml2 C++ Cheatsheet

Getting Started

DOM Parsing

Node Types

Node Operations

XPath Usage

SAX Parsing

Tips

Examples

Namespaces

HTML Parsing

Advanced Usage

Memory Management

Encoding Handling

Advanced XPath

Namespaces

Performance

Troubleshooting

The easiest way to do Web Scraping

Don't leave just yet!