The Ultimate Gumbo C++ Cheatsheet

Gumbo is an HTML5 parsing library in C++. It parses HTML into a tree structure for easy manipulation and extraction.

Getting Started

Include:

#include "gumbo.h"

Parse:

GumboOutput* output = gumbo_parse(html);

Check gumbo_get_error_code() for errors.

Query:

Get document node:

GumboNode* doc = output->root;

Cleanup:

gumbo_destroy_output(&kGumboDefaultOptions, output);

DOM Types

GumboNode:

Parent class for all nodes.

GumboElement:

Element node, contains tag, attributes, and children.

GumboText:

Text node, contains textual content.

GumboAttribute:

Attribute with name and value.

GumboVector:

Array-like container for nodes.

Selecting Nodes

By tag:

GumboNode* node = gumbo_get_element_by_tag(doc, GUMBO_TAG_DIV);

By id:

GumboNode* node = gumbo_get_element_by_id(doc, "someId");

Query selector:

GumboNode* node = gumbo_query_selector(doc, ".someClass");

Children:

GumboVector* children = &node->v.element.children;

Iterate children:

for (int i = 0; i < children->length; ++i) {
  GumboNode* child = static_cast<GumboNode*>(children->data[i]);

  // do something with child
}

Traversing

Parent:

GumboNode* parent = node->parent;

Next sibling:

GumboNode* next = node->next_sibling;

Previous sibling:

GumboNode* prev = node->previous_sibling;

Manipulating Nodes

Create element:

GumboNode* div = gumbo_create_element(GUMBO_TAG_DIV);

Append child:

gumbo_append_child(doc, div);

Insert child:

GumboNode* p = gumbo_insert_before(parent, child, NULL); // after

Remove child:

gumbo_remove_from_parent(child);

Inner HTML:

gumbo_tag_from_original_text(doc, text); // set
std::string html = gumbo_tag_to_original_html(doc); // get

Attributes

Get attribute:

const GumboAttribute* attr = gumbo_get_attribute(node, "id");

Set attribute:

GumboAttribute attr;
attr.name = "href";
attr.value = "link.html";

gumbo_add_attribute(node, &attr);

Remove attribute:

gumbo_remove_attribute(node, "class");

Text Nodes

Extract text:

std::string text = gumbo_text(textNode);

Create text node:

GumboNode* text = gumbo_create_text_node(parser, "Text");

Outputting HTML

To HTML:

std::string html = gumbo_normalize_html(output->root, &kGumboDefaultOptions);

To string:

std::string html = gumbo_stringify(output);

Check errors:

GumboError error = gumbo_get_error_code(output);

Parsing Options

Fragment parsing:

GumboOutput* output = gumbo_parse_fragment(...)

Default options:

struct GumboOutput* output = gumbo_parse_with_options(...)

See GumboParserOptions for all options.

Memory Management

Ownership:

GumboNode* pointers are owned by GumboOutput.

Allocator:

Provide custom allocator:

options.allocator = &custom_allocator;

Cleanup:

gumbo_destroy_output(&options, output);

Frees all memory.

Error Handling

Error codes:

if (gumbo_get_error_code(output) == GUMBO_OK) {
  // no errors
}

See GumboError for error codes.

Error messages:

#define GUMBO_ENABLE_ERROR_MESSAGES

Prints debug error messages.

Tips

Validate tags and attributes before manipulating

Free memory with gumbo_destroy_output()

Handle text nodes separately from elements

Enable GUMBO_ENABLE_ERROR_MESSAGES for debug

Cache parsed documents for performance

Watch out for subtle memory ownership issues

Examples

Parse and print HTML:

GumboOutput* output = gumbo_parse(html);

std::cout << gumbo_normalize_html(output->root);

gumbo_destroy_output(&kGumboDefaultOptions, output);

Extract text:

GumboNode* body = gumbo_get_element_by_tag(doc, GUMBO_TAG_BODY);

for (GumboNode* child = body->v.element.children.data[0];
     child != NULL;
     child = child->next_sibling) {

  if (child->type == GUMBO_NODE_TEXT) {
    std::string text = gumbo_text(child);
    std::cout << text;
  }

}

Change links:

GumboNode* body = gumbo_get_element_by_tag(doc, GUMBO_TAG_BODY);

for (GumboNode* child = body->v.element.children.data[0];
     child != NULL;
     child = child->next_sibling) {

  if (child->type != GUMBO_NODE_ELEMENT) {
    continue;
  }

  GumboAttribute* href = gumbo_get_attribute(child, "href");

  if (href) {
    href->value = "new_link.html";
  }

}

Advanced Usage

Custom memory allocator:

class CustomAllocator : public GumboAllocator {
public:
  virtual void* allocate(...) { ... }
  virtual void free(...) { ... }
};

options.allocator = &customAllocator;

Custom tag callbacks:

options.tag_handler = &MyTagHandler;

class MyTagHandler : GumboTagHandler {
public:
  void startElement(...) { ... }
  void endElement(...) { ... }
};

## Real-World Use Cases

**Web scraping:**

```cpp
// Parse page
GumboOutput* output = gumbo_parse(html);

// Find all links
GumboNode* body = gumbo_get_element_by_tag(output->root, GUMBO_TAG_BODY);
GumboVector* children = &body->v.element.children;

for (unsigned int i = 0; i < children->length; ++i) {
  GumboNode* child = static_cast<GumboNode*>(children->data[i]);

  if (child->type != GUMBO_NODE_ELEMENT) {
    continue;
  }

  GumboAttribute* href = gumbo_get_attribute(child, "href");

  if (href) {
    // Save link for later scraping
    scraped_links.push_back(href->value);
  }

}

// Cleanup
gumbo_destroy_output(&kGumboDefaultOptions, output);

Modifying HTML:

GumboOutput* output = gumbo_parse(html);

// Change tag from <div> to <section>
GumboNode* node = gumbo_get_element_by_id(output->root, "content");
node->v.element.tag = GUMBO_TAG_SECTION;

std::string modified_html = gumbo_normalize_html(output->root);

gumbo_destroy_output(&kGumboDefaultOptions, output);

Building search index:

// Parse document
GumboOutput* output = gumbo_parse(html);

// Extract text from nodes
std::string text = GetText(output->root);

// Save text to index
index.AddDocument(url, text);

// Cleanup
gumbo_destroy_output(&kGumboDefaultOptions, output);

Performance and Memory Usage

Reuse GumboOutput:

GumboOutput* output = gumbo_parse(html);

// Modify DOM...

// Reparse instead of gumbo_destroy_output
gumbo_parse_with_reused_output(html, output);

Cache parsed documents:

// Cache mapping URLs to GumboOutput
std::unordered_map<std::string, GumboOutput*> cache;

GumboOutput* Parse(const std::string& url) {
  if (cache.find(url) != cache.end()) {
    return cache[url];
  }

  GumboOutput* output = gumbo_parse(LoadHTML(url));
  cache[url] = output;
  return output;
}

Custom allocator:

class MyAllocator : public GumboAllocator {
  // Implement allocate and free...
};

// Set custom allocator
options.allocator = &myAllocator;

Advanced Callbacks

Tag callbacks:

class LinkParser : public GumboTagHandler {
public:
  void startElement(GumboTag tag,...) {
    if (tag == GUMBO_TAG_A) {
      // Extract link
    }
  }
}

// Set tag handler
options.tag_handler = &linkParser;

Attribute callbacks:

void ExtractImages(const GumboAttribute* attr) {
  if (attr->name == "src" && attr->value.find(".jpg")) {
    // Save image
  }
}

options.attribute_handler = ExtractImages;

Common Pitfalls

Memory leaks:

Remember to call gumbo_destroy_output() after parsing.

Invalid HTML:

Handle errors gracefully when parsing malformed HTML.

Pointer errors:

Nodes are owned by GumboOutput. Don't delete separately.

Troubleshooting

Crashing:

Enable GUMBO_ENABLE_ERROR_MESSAGES to see debug info.

Check return value of gumbo_parse() for errors.

Use a memory checker like valgrind.

Unexpected output:

Validate tags and attributes before manipulating DOM.

Some input HTML may not produce expected output.

Errors:

Handle errors from gumbo_get_error_code() gracefully.

Consult GumboError documentation for error codes.

FAQ

Q: Why not just use libxml2?

A: Gumbo is focused just on HTML while libxml2 supports XML. Gumbo may be easier to use for some HTML tasks.

Q: Is Gumbo thread-safe?

A: No, you need to synchronize multi-threaded access to GumboOutput.

Q: What browsers does Gumbo support?

A: Gumbo aims for compatibility with all modern browsers. See docs for details.

Additional Resources

Gumbo documentation

HTML parser comparison

Setting up Gumbo on Windows

The Ultimate Gumbo C++ Cheatsheet

Getting Started

DOM Types

Selecting Nodes

Traversing

Manipulating Nodes

Attributes

Text Nodes

Outputting HTML

Parsing Options

Memory Management

Error Handling

Tips

Examples

Advanced Usage

Performance and Memory Usage

Advanced Callbacks

Common Pitfalls

Troubleshooting

FAQ

Additional Resources

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

The Ultimate Gumbo C++ Cheatsheet

Getting Started

DOM Types

Selecting Nodes

Traversing

Manipulating Nodes

Attributes

Text Nodes

Outputting HTML

Parsing Options

Memory Management

Error Handling

Tips

Examples

Advanced Usage

Performance and Memory Usage

Advanced Callbacks

Common Pitfalls

Troubleshooting

FAQ

Additional Resources

The easiest way to do Web Scraping

Don't leave just yet!