HTML::TreeBuilder is a Perl module that parses HTML and XML documents into a tree structure. It allows you to manipulate the document tree easily.
Installation
To install HTML::TreeBuilder:
perl -MCPAN -e 'install HTML::TreeBuilder'
Or add it to your Perl project's cpanfile and run
requires 'HTML::TreeBuilder';
Basic Usage
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new;
$tree->parse_file("file.html");
my $root = $tree->root;
This parses the HTML file and stores the document tree in
Walking the Tree
To access child nodes:
my @children = $root->content_list;
To get specific child by index:
my $child2 = $root->content_list->[1];
Loop through children:
foreach my $child ($root->content_list) {
# do something with $child
}
Navigate to parent:
my $parent = $node->parent;
Common Node Methods
tag
Get node's tag name:
my $tag = $node->tag;
text
Get node's inner text:
my $text = $node->text;
attr
Get attribute value by name:
my $class = $node->attr('class');
push_content
Add child node:
$parent->push_content($child);
prepend_content
Insert child at beginning:
$parent->prepend_content($newchild);
delete
Remove node:
$node->delete;
replace_with
Replace node with new node:
$oldnode->replace_with($newnode);
Searching the Tree
look_down
Find node recursively:
my $img = $root->look_down(_tag => 'img');
find_by_tag_name
Find all nodes by tag name:
my @divs = $root->find_by_tag_name('div');
find_by_attribute
Find nodes by attribute value:
my @figs = $root->find_by_attribute({class => 'figure'});
Modifying the Tree
set_tag_name
Change node's tag:
$node->set_tag_name('div');
set_text
Set node's text content:
$node->set_text("New text");
set_attr
Set attribute value:
$node->set_attr(class => 'blue');
append_child
Add child to end:
$parent->append_child($child);
Outputting HTML
as_HTML
Serialize tree back to HTML:
print $tree->as_HTML;
as_text
Output text content only:
print $tree->as_text;
Full Example
Here is an example script that loads HTML, finds all
use strict;
use warnings;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new;
$tree->parse_file("index.html");
my @imgs = $tree->find_by_tag_name('img');
foreach my $img (@imgs) {
$img->set_attr(width => 100);
}
print $tree->as_HTML;
Complex Tree Manipulation
More complex traversal and modification of the tree:
# Recursively find all <td> elements
my @cells = $root->look_down(sub {
my $node = shift;
return $node->tag eq 'td';
});
# Prune a subtree
my $pruned = $tree->prune($root->content_list->[2]);
# Swap two node positions
my $tmp = $n1->right;
$n1->right($n2->right);
$n2->right($tmp);
Custom Parsers and Handlers
Integrating custom HTML parsers:
# Use HTML::Parser for parsing malformed markup
my $parser = HTML::Parser->new(api_version => 3);
my $handler = HTML::TreeBuilder->new;
$parser->parse( $html, $handler);
my $tree = $handler->tree;
Performance and Memory Optimization
Avoid retaining entire tree in memory:
# Discard subtree after extracting info
my $info = $tree->look_down(sub {
my $node = shift;
return $node->text if $node->tag eq 'div';
});
$tree->prune($subtree);
Real-World Use Cases
Scraping content from HTML:
# Extract article content
my $article;
foreach my $child (@{$root->content_list}) {
if ($child->tag eq 'article') {
$article = $child;
last;
}
}
my $text = $article->as_text;
Using HTML::TreeBuilder for templating:
# Template system
my $template = HTML::TreeBuilder->new;
$template->parse(\\$html);
# ... logic to fill template ...
$template->find_by_attribute(id => 'main')
->replace_with($content);
print $template->as_HTML;
Tips and Tricks
if ($node->content_list) {
# has children
}
$node->delete_content;
my $first = $node->first_child;
my $last = $node->last_child;
Comparison with Mojo::DOM
HTML::TreeBuilder | Mojo::DOM |
Maintains parent/child relationships | No persistent structure |
Modifying original tree | Parsed copy, original unchanged |
Heavier memory usage | Lower memory footprint |
Straightforward DOM interface | CSS selector-based methods |
Error Handling
# Wrap in eval block
eval {
$tree->parse($html);
};
if ($@) {
die "Parse error: $@";
}