The Ultimate HTML::TreeBuilder Cheatsheet in Perl

HTML::TreeBuilder is a Perl module that parses HTML and XML documents into a tree structure. It allows you to manipulate the document tree easily.

Installation

To install HTML::TreeBuilder:

perl -MCPAN -e 'install HTML::TreeBuilder'

Or add it to your Perl project's cpanfile and run cpanm:

requires 'HTML::TreeBuilder';

Basic Usage

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new;

$tree->parse_file("file.html");

my $root = $tree->root;

This parses the HTML file and stores the document tree in $tree. The root node is available via $tree->root.

Walking the Tree

To access child nodes:

my @children = $root->content_list;

To get specific child by index:

my $child2 = $root->content_list->[1];

Loop through children:

foreach my $child ($root->content_list) {
  # do something with $child
}

Navigate to parent:

my $parent = $node->parent;

Common Node Methods

tag

Get node's tag name:

my $tag = $node->tag;

text

Get node's inner text:

my $text = $node->text;

attr

Get attribute value by name:

my $class = $node->attr('class');

push_content

Add child node:

$parent->push_content($child);

prepend_content

Insert child at beginning:

$parent->prepend_content($newchild);

delete

Remove node:

$node->delete;

replace_with

Replace node with new node:

$oldnode->replace_with($newnode);

Searching the Tree

look_down

Find node recursively:

my $img = $root->look_down(_tag => 'img');

find_by_tag_name

Find all nodes by tag name:

my @divs = $root->find_by_tag_name('div');

find_by_attribute

Find nodes by attribute value:

my @figs = $root->find_by_attribute({class => 'figure'});

Modifying the Tree

set_tag_name

Change node's tag:

$node->set_tag_name('div');

set_text

Set node's text content:

$node->set_text("New text");

set_attr

Set attribute value:

$node->set_attr(class => 'blue');

append_child

Add child to end:

$parent->append_child($child);

Outputting HTML

as_HTML

Serialize tree back to HTML:

print $tree->as_HTML;

as_text

Output text content only:

print $tree->as_text;

Full Example

Here is an example script that loads HTML, finds all tags, and sets their width to 100:

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new;
$tree->parse_file("index.html");

my @imgs = $tree->find_by_tag_name('img');

foreach my $img (@imgs) {
  $img->set_attr(width => 100);
}

print $tree->as_HTML;

Complex Tree Manipulation

More complex traversal and modification of the tree:

# Recursively find all <td> elements
my @cells = $root->look_down(sub {
  my $node = shift;
  return $node->tag eq 'td';
});

# Prune a subtree
my $pruned = $tree->prune($root->content_list->[2]);

# Swap two node positions
my $tmp = $n1->right;
$n1->right($n2->right);
$n2->right($tmp);

Custom Parsers and Handlers

Integrating custom HTML parsers:

# Use HTML::Parser for parsing malformed markup
my $parser = HTML::Parser->new(api_version => 3);
my $handler = HTML::TreeBuilder->new;

$parser->parse( $html, $handler);
my $tree = $handler->tree;

Performance and Memory Optimization

Avoid retaining entire tree in memory:

# Discard subtree after extracting info
my $info = $tree->look_down(sub {
  my $node = shift;
  return $node->text if $node->tag eq 'div';
});

$tree->prune($subtree);

Real-World Use Cases

Scraping content from HTML:

# Extract article content
my $article;
foreach my $child (@{$root->content_list}) {
  if ($child->tag eq 'article') {
    $article = $child;
    last;
  }
}

my $text = $article->as_text;

Using HTML::TreeBuilder for templating:

# Template system

my $template = HTML::TreeBuilder->new;
$template->parse(\\$html);

# ... logic to fill template ...

$template->find_by_attribute(id => 'main')
         ->replace_with($content);

print $template->as_HTML;

Tips and Tricks

Check if a node has children:

if ($node->content_list) {
  # has children
}

Remove all children:

$node->delete_content;

Get first/last child:

my $first = $node->first_child;
my $last = $node->last_child;

Comparison with Mojo::DOM

HTML::TreeBuilder	Mojo::DOM
Maintains parent/child relationships	No persistent structure
Modifying original tree	Parsed copy, original unchanged
Heavier memory usage	Lower memory footprint
Straightforward DOM interface	CSS selector-based methods

Error Handling

# Wrap in eval block
eval {
  $tree->parse($html);
};
if ($@) {
  die "Parse error: $@";
}

The Ultimate HTML::TreeBuilder Cheatsheet in Perl

Installation

Basic Usage

Walking the Tree

Common Node Methods

tag

text

attr

push_content

prepend_content

delete

replace_with

Searching the Tree

look_down

find_by_tag_name

find_by_attribute

Modifying the Tree

set_tag_name

set_text

set_attr

append_child

Outputting HTML

as_HTML

as_text

Full Example

Complex Tree Manipulation

Custom Parsers and Handlers

Performance and Memory Optimization

Real-World Use Cases

Tips and Tricks

Comparison with Mojo::DOM

Error Handling

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

The Ultimate HTML::TreeBuilder Cheatsheet in Perl

Installation

Basic Usage

Walking the Tree

Common Node Methods

tag

text

attr

push_content

prepend_content

delete

replace_with

Searching the Tree

look_down

find_by_tag_name

find_by_attribute

Modifying the Tree

set_tag_name

set_text

set_attr

append_child

Outputting HTML

as_HTML

as_text

Full Example

Complex Tree Manipulation

Custom Parsers and Handlers

Performance and Memory Optimization

Real-World Use Cases

Tips and Tricks

Comparison with Mojo::DOM

Error Handling

The easiest way to do Web Scraping

Don't leave just yet!