The Ultimate HTML::Parser Perl Cheat Sheet

Oct 31, 2023 ยท 4 min read

HTML::Parser is a Perl module that parses HTML/XML documents and provides access to their elements for content extraction and manipulation.

Getting Started


cpan HTML::Parser
# or from package manager

HTML::Parser is distributed on CPAN. It can also be installed from most Perl package managers.

Simple parsing:

use HTML::Parser;

my $parser = HTML::Parser->new;

my $text = $parser->text;
print $text;

This parses an HTML string and prints the extracted text content.


  • Parser creates a tree from HTML/XML input
  • Handlers called on events during parsing
  • Tree can be traversed/manipulated after parsing
  • Parsing HTML

    From string:


    From file:

    open(my $fh, '<', $file);

    From URL:

    use LWP::Simple;
    my $html = get($url);

    Parse options:

    $parser->strict(1); # die on invalid HTML
    $parser->junk_text(1); # ignore text outside of elements
    $parser->parse_fragment($html); # parse partial HTML

    Accessing Elements

    By tag name:

    my @divs = $parser->elements_by_tagname('div');

    By attribute:

    my @inputs = $parser->elements_with_attribute('name');

    With XPath:

    my @links = $parser->xpath('//a');

    Extract links:

    my @links = $parser->extract_links();

    Element content:

    my $content = $element->[0]; # inside HTML

    Manipulating HTML

    Modify elements:

    $parser->handler(start => sub {
      my ($tag, $attr) = @_;
      $attr->{class} = 'newclass';

    Remove elements:

    $parser->handler(discard_element => "script");

    Modify text:

    $parser->handler(text => sub {
      my $text = shift;
      $text =~ s/foo/bar/g;
      return $text;

    Insert elements:

    $parser->handler(start => sub {
      my $elem = shift;
      $elem->push_content("<div>New elem</div>");

    Handlers and Events

    Start handler:

    $parser->handler(start => \\&start, "div");
    sub start {
      my ($tag, $attr, $self) = @_;
      print "Start $tag\\n";

    End handler:

    $parser->handler(end => \\&end, "div");
    sub end {
      print "End div\\n";

    Available events: text, comment, process, declaration

    Tree Traversal

    $parser->handler(start => sub {
      my $elem = $_->[1];
    sub process {
      my $node = shift;
      # process node
    my $parent = $node->parent;
    my @children = $node->content_list;


    Web scraping:

    use Web::Scraper;
    my $scraper = scraper {
      process "div.results", "links[]" => scraper::attr("href");
    $scraper->parse_html($parser, $html);


    get '/' => sub {
      my $parser = HTML::Parser->new;
      # process parser


    use HTML::Parser;
    use XML::FeedPP;
    my $parser = HTML::Parser->new;
    my $feed = XML::FeedPP->new(handlers => $parser);

    Parsing Edge Cases

    Malformed HTML:

    Use junk_text option to ignore errors:


    Extract data from invalid markup:

    $parser->handler(text => sub {
      my $text = shift;
      if($text =~ /(\\d{4}-\\d{2}-\\d{2})/) {
        return $1; # extract date

    Parse fragments:


    Embedded content:

    $parser->handler(text => sub {
      my $text = shift;
      if($text =~ /<style/i) {
        $text = ""; # remove CSS
      return $text;

    Best Practices

  • Use strict parsing for clean HTML
  • Avoid regex when possible - parse then extract
  • Benchmark different parsing options
  • Handle character encoding explicitly
  • Troubleshooting


    use HTML::Parser::Debug;

    Common bugs:

  • Infinite loops from recursive handlers
  • Memory leaks if tree isn't freed
  • Encoding issues
  • Buggy regex losing text
  • Error handling:

    eval {
    if($@) {
      print "Parse failed: $@";


    Custom parsers:

    package MyParser;
    @ISA = ('HTML::Parser');
    sub start {
      my $self = shift;
      # custom logic


    use HTML::Parser::Plugins;

    Modifying architecture:

    @HTML::Parser::ISA = qw(MyParser);

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you

    Try ProxiesAPI for free

    curl ""

    <!doctype html>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />


    Don't leave just yet!

    Enter your email below to claim your free API key: