The Ultimate HTML::Parser Perl Cheat Sheet

HTML::Parser is a Perl module that parses HTML/XML documents and provides access to their elements for content extraction and manipulation.

Getting Started

Installation:

cpan HTML::Parser
# or from package manager

HTML::Parser is distributed on CPAN. It can also be installed from most Perl package managers.

Simple parsing:

use HTML::Parser;

my $parser = HTML::Parser->new;
$parser->parse($html);

my $text = $parser->text;
print $text;

This parses an HTML string and prints the extracted text content.

Concepts:

Parser creates a tree from HTML/XML input

Handlers called on events during parsing

Tree can be traversed/manipulated after parsing

Parsing HTML

From string:

$parser->parse($html_string);

From file:

open(my $fh, '<', $file);
$parser->parse_file($fh);

From URL:

use LWP::Simple;
my $html = get($url);
$parser->parse($html);

Parse options:

$parser->strict(1); # die on invalid HTML

$parser->junk_text(1); # ignore text outside of elements

$parser->parse_fragment($html); # parse partial HTML

Accessing Elements

By tag name:

my @divs = $parser->elements_by_tagname('div');

By attribute:

my @inputs = $parser->elements_with_attribute('name');

With XPath:

my @links = $parser->xpath('//a');

Extract links:

my @links = $parser->extract_links();

Element content:

my $content = $element->[0]; # inside HTML

Manipulating HTML

Modify elements:

$parser->handler(start => sub {
  my ($tag, $attr) = @_;
  $attr->{class} = 'newclass';
});

Remove elements:

$parser->handler(discard_element => "script");

Modify text:

$parser->handler(text => sub {
  my $text = shift;
  $text =~ s/foo/bar/g;
  return $text;
});

Insert elements:

$parser->handler(start => sub {
  my $elem = shift;
  $elem->push_content("<div>New elem</div>");
});

Handlers and Events

Start handler:

$parser->handler(start => \\&start, "div");

sub start {
  my ($tag, $attr, $self) = @_;

  print "Start $tag\\n";
}

End handler:

$parser->handler(end => \\&end, "div");

sub end {
  print "End div\\n";
}

Available events: text, comment, process, declaration

Tree Traversal

$parser->handler(start => sub {
  my $elem = $_->[1];
  $elem->traverse(\\&process);
});

sub process {
  my $node = shift;

  # process node
}

my $parent = $node->parent;
my @children = $node->content_list;

Integration

Web scraping:

use Web::Scraper;

my $scraper = scraper {
  process "div.results", "links[]" => scraper::attr("href");
};

$scraper->parse_html($parser, $html);

Mojolicious:

get '/' => sub {
  my $parser = HTML::Parser->new;
  $parser->parse($html);

  # process parser

  $self->render;
};

Feeds:

use HTML::Parser;
use XML::FeedPP;

my $parser = HTML::Parser->new;
my $feed = XML::FeedPP->new(handlers => $parser);
$feed->parse($feed_xml);

Parsing Edge Cases

Malformed HTML:

Use junk_text option to ignore errors:

$parser->junk_text(1);

Extract data from invalid markup:

$parser->handler(text => sub {
  my $text = shift;
  if($text =~ /(\\d{4}-\\d{2}-\\d{2})/) {
    return $1; # extract date
  }
});

Parse fragments:

$parser->parse_fragment($html);

Embedded content:

$parser->handler(text => sub {
  my $text = shift;
  if($text =~ /<style/i) {
    $text = ""; # remove CSS
  }
  return $text;
});

Best Practices

Use strict parsing for clean HTML

Avoid regex when possible - parse then extract

Benchmark different parsing options

Handle character encoding explicitly

Troubleshooting

Debugging:

use HTML::Parser::Debug;
$debug_parser->parse($html);

Common bugs:

Infinite loops from recursive handlers

Memory leaks if tree isn't freed

Encoding issues

Buggy regex losing text

Error handling:

eval {
  $parser->parse($html);
};
if($@) {
  print "Parse failed: $@";
}

Customizing

Custom parsers:

package MyParser;
@ISA = ('HTML::Parser');

sub start {
  my $self = shift;
  # custom logic
}

Extending:

use HTML::Parser::Plugins;
$parser->plugin(MyPlugin);

Modifying architecture:

@HTML::Parser::ISA = qw(MyParser);

The Ultimate HTML::Parser Perl Cheat Sheet

Getting Started

Parsing HTML

Accessing Elements

Manipulating HTML

Handlers and Events

Tree Traversal

Integration

Parsing Edge Cases

Best Practices

Troubleshooting

Customizing

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

The Ultimate HTML::Parser Perl Cheat Sheet

Getting Started

Parsing HTML

Accessing Elements

Manipulating HTML

Handlers and Events

Tree Traversal

Integration

Parsing Edge Cases

Best Practices

Troubleshooting

Customizing

The easiest way to do Web Scraping

Don't leave just yet!