The Ultimate Gumbo C++ Cheatsheet

Oct 31, 2023 ยท 6 min read

Gumbo is an HTML5 parsing library in C++. It parses HTML into a tree structure for easy manipulation and extraction.

Getting Started

Include:

#include "gumbo.h"

Parse:

GumboOutput* output = gumbo_parse(html);

Check gumbo_get_error_code() for errors.

Query:

Get document node:

GumboNode* doc = output->root;

Cleanup:

gumbo_destroy_output(&kGumboDefaultOptions, output);

DOM Types

GumboNode:

Parent class for all nodes.

GumboElement:

Element node, contains tag, attributes, and children.

GumboText:

Text node, contains textual content.

GumboAttribute:

Attribute with name and value.

GumboVector:

Array-like container for nodes.

Selecting Nodes

By tag:

GumboNode* node = gumbo_get_element_by_tag(doc, GUMBO_TAG_DIV);

By id:

GumboNode* node = gumbo_get_element_by_id(doc, "someId");

Query selector:

GumboNode* node = gumbo_query_selector(doc, ".someClass");

Children:

GumboVector* children = &node->v.element.children;

Iterate children:

for (int i = 0; i < children->length; ++i) {
  GumboNode* child = static_cast<GumboNode*>(children->data[i]);

  // do something with child
}

Traversing

Parent:

GumboNode* parent = node->parent;

Next sibling:

GumboNode* next = node->next_sibling;

Previous sibling:

GumboNode* prev = node->previous_sibling;

Manipulating Nodes

Create element:

GumboNode* div = gumbo_create_element(GUMBO_TAG_DIV);

Append child:

gumbo_append_child(doc, div);

Insert child:

GumboNode* p = gumbo_insert_before(parent, child, NULL); // after

Remove child:

gumbo_remove_from_parent(child);

Inner HTML:

gumbo_tag_from_original_text(doc, text); // set
std::string html = gumbo_tag_to_original_html(doc); // get

Attributes

Get attribute:

const GumboAttribute* attr = gumbo_get_attribute(node, "id");

Set attribute:

GumboAttribute attr;
attr.name = "href";
attr.value = "link.html";

gumbo_add_attribute(node, &attr);

Remove attribute:

gumbo_remove_attribute(node, "class");

Text Nodes

Extract text:

std::string text = gumbo_text(textNode);

Create text node:

GumboNode* text = gumbo_create_text_node(parser, "Text");

Outputting HTML

To HTML:

std::string html = gumbo_normalize_html(output->root, &kGumboDefaultOptions);

To string:

std::string html = gumbo_stringify(output);

Check errors:

GumboError error = gumbo_get_error_code(output);

Parsing Options

Fragment parsing:

GumboOutput* output = gumbo_parse_fragment(...)

Default options:

struct GumboOutput* output = gumbo_parse_with_options(...)

See GumboParserOptions for all options.

Memory Management

Ownership:

GumboNode* pointers are owned by GumboOutput.

Allocator:

Provide custom allocator:

options.allocator = &custom_allocator;

Cleanup:

gumbo_destroy_output(&options, output);

Frees all memory.

Error Handling

Error codes:

if (gumbo_get_error_code(output) == GUMBO_OK) {
  // no errors
}

See GumboError for error codes.

Error messages:

#define GUMBO_ENABLE_ERROR_MESSAGES

Prints debug error messages.

Tips

  • Validate tags and attributes before manipulating
  • Free memory with gumbo_destroy_output()
  • Handle text nodes separately from elements
  • Enable GUMBO_ENABLE_ERROR_MESSAGES for debug
  • Cache parsed documents for performance
  • Watch out for subtle memory ownership issues
  • Examples

    Parse and print HTML:

    GumboOutput* output = gumbo_parse(html);
    
    std::cout << gumbo_normalize_html(output->root);
    
    gumbo_destroy_output(&kGumboDefaultOptions, output);
    

    Extract text:

    GumboNode* body = gumbo_get_element_by_tag(doc, GUMBO_TAG_BODY);
    
    for (GumboNode* child = body->v.element.children.data[0];
         child != NULL;
         child = child->next_sibling) {
    
      if (child->type == GUMBO_NODE_TEXT) {
        std::string text = gumbo_text(child);
        std::cout << text;
      }
    
    }
    

    Change links:

    GumboNode* body = gumbo_get_element_by_tag(doc, GUMBO_TAG_BODY);
    
    for (GumboNode* child = body->v.element.children.data[0];
         child != NULL;
         child = child->next_sibling) {
    
      if (child->type != GUMBO_NODE_ELEMENT) {
        continue;
      }
    
      GumboAttribute* href = gumbo_get_attribute(child, "href");
    
      if (href) {
        href->value = "new_link.html";
      }
    
    }
    

    Advanced Usage

    Custom memory allocator:

    class CustomAllocator : public GumboAllocator {
    public:
      virtual void* allocate(...) { ... }
      virtual void free(...) { ... }
    };
    
    options.allocator = &customAllocator;
    

    Custom tag callbacks:

    options.tag_handler = &MyTagHandler;
    
    class MyTagHandler : GumboTagHandler {
    public:
      void startElement(...) { ... }
      void endElement(...) { ... }
    };
    

    ## Real-World Use Cases
    
    **Web scraping:**
    
    ```cpp
    // Parse page
    GumboOutput* output = gumbo_parse(html);
    
    // Find all links
    GumboNode* body = gumbo_get_element_by_tag(output->root, GUMBO_TAG_BODY);
    GumboVector* children = &body->v.element.children;
    
    for (unsigned int i = 0; i < children->length; ++i) {
      GumboNode* child = static_cast<GumboNode*>(children->data[i]);
    
      if (child->type != GUMBO_NODE_ELEMENT) {
        continue;
      }
    
      GumboAttribute* href = gumbo_get_attribute(child, "href");
    
      if (href) {
        // Save link for later scraping
        scraped_links.push_back(href->value);
      }
    
    }
    
    // Cleanup
    gumbo_destroy_output(&kGumboDefaultOptions, output);
    

    Modifying HTML:

    GumboOutput* output = gumbo_parse(html);
    
    // Change tag from <div> to <section>
    GumboNode* node = gumbo_get_element_by_id(output->root, "content");
    node->v.element.tag = GUMBO_TAG_SECTION;
    
    std::string modified_html = gumbo_normalize_html(output->root);
    
    gumbo_destroy_output(&kGumboDefaultOptions, output);
    

    Building search index:

    // Parse document
    GumboOutput* output = gumbo_parse(html);
    
    // Extract text from nodes
    std::string text = GetText(output->root);
    
    // Save text to index
    index.AddDocument(url, text);
    
    // Cleanup
    gumbo_destroy_output(&kGumboDefaultOptions, output);
    

    Performance and Memory Usage

    Reuse GumboOutput:

    GumboOutput* output = gumbo_parse(html);
    
    // Modify DOM...
    
    // Reparse instead of gumbo_destroy_output
    gumbo_parse_with_reused_output(html, output);
    

    Cache parsed documents:

    // Cache mapping URLs to GumboOutput
    std::unordered_map<std::string, GumboOutput*> cache;
    
    GumboOutput* Parse(const std::string& url) {
      if (cache.find(url) != cache.end()) {
        return cache[url];
      }
    
      GumboOutput* output = gumbo_parse(LoadHTML(url));
      cache[url] = output;
      return output;
    }
    

    Custom allocator:

    class MyAllocator : public GumboAllocator {
      // Implement allocate and free...
    };
    
    // Set custom allocator
    options.allocator = &myAllocator;
    

    Advanced Callbacks

    Tag callbacks:

    class LinkParser : public GumboTagHandler {
    public:
      void startElement(GumboTag tag,...) {
        if (tag == GUMBO_TAG_A) {
          // Extract link
        }
      }
    }
    
    // Set tag handler
    options.tag_handler = &linkParser;
    

    Attribute callbacks:

    void ExtractImages(const GumboAttribute* attr) {
      if (attr->name == "src" && attr->value.find(".jpg")) {
        // Save image
      }
    }
    
    options.attribute_handler = ExtractImages;
    

    Common Pitfalls

    Memory leaks:

    Remember to call gumbo_destroy_output() after parsing.

    Invalid HTML:

    Handle errors gracefully when parsing malformed HTML.

    Pointer errors:

    Nodes are owned by GumboOutput. Don't delete separately.

    Troubleshooting

    Crashing:

  • Enable GUMBO_ENABLE_ERROR_MESSAGES to see debug info.
  • Check return value of gumbo_parse() for errors.
  • Use a memory checker like valgrind.
  • Unexpected output:

  • Validate tags and attributes before manipulating DOM.
  • Some input HTML may not produce expected output.
  • Errors:

  • Handle errors from gumbo_get_error_code() gracefully.
  • Consult GumboError documentation for error codes.
  • FAQ

    Q: Why not just use libxml2?

    A: Gumbo is focused just on HTML while libxml2 supports XML. Gumbo may be easier to use for some HTML tasks.

    Q: Is Gumbo thread-safe?

    A: No, you need to synchronize multi-threaded access to GumboOutput.

    Q: What browsers does Gumbo support?

    A: Gumbo aims for compatibility with all modern browsers. See docs for details.

    Additional Resources

  • Gumbo documentation
  • HTML parser comparison
  • Setting up Gumbo on Windows
  • Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!