The Complete Libxml2 C++ Cheatsheet

Oct 31, 2023 ยท 4 min read

Libxml2 is a XML processing library written in C for use in C/C++ applications. It provides DOM, SAX, XMLReader, XPath and XPointer support.

Getting Started

Include:

#include <libxml/parser.h>
#include <libxml/tree.h>

Parse:

xmlDocPtr doc = xmlParseFile("file.xml");

or

xmlDocPtr doc = xmlParseMemory(xml, size);

Validate:

xmlSchemaPtr schema = xmlSchemaNewParserCtxt("schema.xsd");
xmlSchemaValidCtxtPtr valid = xmlSchemaNewValidCtxt(schema);
xmlSchemaValidateDoc(valid, doc);

Check xmlSchemaValidCtxtGetParserErrors() for errors.

Cleanup:

xmlFreeDoc(doc);
xmlSchemaFreeValidCtxt(valid);
xmlSchemaFree(schema);

DOM Parsing

Get root element:

xmlNodePtr root = xmlDocGetRootElement(doc);

Iterate children:

for(xmlNodePtr cur = root->children; cur != NULL; cur = cur->next) {
  // process cur node
}

Get child:

xmlNodePtr child = root->children;

Node Types

xmlNode: Base node class.

xmlElem: Element nodes.

xmlText: Text nodes.

xmlAttr: Attribute nodes.

xmlNs: Namespace nodes.

Node Operations

Add child:

xmlNodePtr child = xmlNewChild(parent, NULL, "node", NULL);

Set/get properties:

xmlSetProp(node, "key", "value");
xmlGetProp(node, "key");

Set/get content:

xmlNodeSetContent(node, "text");
xmlNodeGetContent(node);

Remove node:

xmlUnlinkNode(node);
xmlFreeNode(node);

XPath Usage

Evaluate xpath:

xmlXPathContextPtr ctxt = xmlXPathNewContext(doc);
xmlXPathObjectPtr result = xmlXPathEvalExpression(ctxt, "/root/node");

Get node result:

if(result->nodesetval->nodeNr > 0) {
  xmlNodePtr node = result->nodesetval->nodeTab[0];
}

Get string result:

if(result->type == XPATH_STRING) {
  xmlChar *str = result->stringval;
}

Cleanup:

xmlXPathFreeObject(result);
xmlXPathFreeContext(ctxt);

SAX Parsing

Create parser:

xmlSAXHandler sax;
memset(sax, 0, sizeof(sax));

xmlSAXParserCreate(&sax, NULL);

Set handlers:

sax.startDocument = &startDocHandler;
sax.endElement = &endElementHandler;

Parse:

xmlSAXUserParseFile(&sax, "file.xml");

Tips

  • Use xmlFree() to free nodes
  • Check return values for errors
  • Validate with schemas before processing
  • Mind encoding when outputting XML
  • Avoid XPath injection from user input
  • Examples

    Modify XML:

    xmlDocPtr doc = xmlParseFile("data.xml");
    
    xmlNodePtr root = xmlDocGetRootElement(doc);
    
    xmlNodePtr node = xmlNewChild(root, NULL, "newNode", NULL);
    xmlSetProp(node, "key", "value");
    
    xmlSaveFile(doc, "out.xml");
    xmlFreeDoc(doc);
    

    Extract text:

    xmlXPathContextPtr ctxt = xmlXPathNewContext(doc);
    xmlXPathObjectPtr result = xmlXPathEvalExpression(ctxt, "/root/node/text()");
    
    if(result->type == XPATH_STRING) {
      std::cout << result->stringval << std::endl;
    }
    
    xmlXPathFreeObject(result);
    xmlXPathFreeContext(ctxt);
    

    Namespaces

    Register namespace:

    xmlNewNs(node, "<http://ns>", "ns");
    

    Add with namespace:

    xmlNewChild(node, ns, "ns:child", NULL);
    

    Search with namespace:

    xpath = "/ns:root/ns:node";
    

    HTML Parsing

    Parse HTML:

    htmlDocPtr doc = htmlReadFile("file.html", NULL, HTML_PARSE_NOERROR);
    

    Print HTML:

    htmlDocDump(stdout, doc);
    

    Tidy:

    htmlDocPtr tidy = htmlReadDoc(doc, "utf8", htmlTidyDocDefaultOptions);
    

    Advanced Usage

    Custom streams:

    xmlParserCtxtPtr ctxt = xmlCreatePushParserCtxt(&sax, NULL, NULL, 0, NULL);
    
    while(moreData) {
      xmlParseChunk(ctxt, data, size, 0);
    }
    
    xmlParseChunk(ctxt, NULL, 0, 1); // end
    

    Custom memory:

    xmlMemSetup(xmlFree, xmlMalloc, xmlRealloc, xmlStrdup);
    

    Debug memory:

    xmlMemUsed(); // check used mem
    

    Memory Management

    Proper memory management is critical when using libxml2 to avoid leaks.

    Free document trees:

    xmlFreeDoc(doc);
    

    Frees the entire document tree.

    Free nodes:

    xmlFreeNode(node);
    

    Frees a specific node. Parent links and children aren't modified.

    Avoid leaks:

  • Free document when no longer needed.
  • Free nodes after removing from tree.
  • Set nodes to NULL after freeing.
  • Encoding Handling

    Parse with encoding:

    doc = htmlReadDoc(buffer, "UTF-8", XML_PARSE_NOERROR);
    

    Output encoding:

    xmlSaveFormatFileEnc(file, doc, "UTF-8", 1);
    

    Avoid encoding issues:

  • Explicitly set encoding on parse and output.
  • Use UTF-8 internally if possible.
  • Use iconv for conversions.
  • Advanced XPath

    Predicates:

    /book[author='James']
    

    Axes:

    //ancestor::chapter
    

    Functions:

    count(//book)
    

    Namespaces

    Register namespace:

    xmlNewNs(node, "<http://ns>", "ns");
    

    Use in XPath:

    /ns:book/ns:title
    

    Default namespace:

    <root xmlns="<http://ns>">
    

    Now unprefixed elements like refer to the default namespace.

    Performance

    Parser options:

    xmlReadDoc(doc, "nonet", XML_PARSE_NOENT);
    

    Disables network access and entity substitution.

    Reuse contexts:

    Avoid creating new xpathContext for each query.

    Cache nodes/results:

    Cache costly lookups or searches.

    Troubleshooting

    HTML parse errors:

    Use XML_PARSE_RECOVER to recover from common HTML errors.

    XPath type errors:

    Cast string results when needed.

    string(//title)
    

    Memory leaks:

    Use valgrind, instrumentation, logging to detect unreleased memory.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!