The Ultimate HTML::Parser Perl Cheat Sheet

Oct 31, 2023 ยท 4 min read

HTML::Parser is a Perl module that parses HTML/XML documents and provides access to their elements for content extraction and manipulation.

Getting Started

Installation:

cpan HTML::Parser
# or from package manager

HTML::Parser is distributed on CPAN. It can also be installed from most Perl package managers.

Simple parsing:

use HTML::Parser;

my $parser = HTML::Parser->new;
$parser->parse($html);

my $text = $parser->text;
print $text;

This parses an HTML string and prints the extracted text content.

Concepts:

  • Parser creates a tree from HTML/XML input
  • Handlers called on events during parsing
  • Tree can be traversed/manipulated after parsing
  • Parsing HTML

    From string:

    $parser->parse($html_string);
    

    From file:

    open(my $fh, '<', $file);
    $parser->parse_file($fh);
    

    From URL:

    use LWP::Simple;
    my $html = get($url);
    $parser->parse($html);
    

    Parse options:

    $parser->strict(1); # die on invalid HTML
    
    $parser->junk_text(1); # ignore text outside of elements
    
    $parser->parse_fragment($html); # parse partial HTML
    

    Accessing Elements

    By tag name:

    my @divs = $parser->elements_by_tagname('div');
    

    By attribute:

    my @inputs = $parser->elements_with_attribute('name');
    

    With XPath:

    my @links = $parser->xpath('//a');
    

    Extract links:

    my @links = $parser->extract_links();
    

    Element content:

    my $content = $element->[0]; # inside HTML
    

    Manipulating HTML

    Modify elements:

    $parser->handler(start => sub {
      my ($tag, $attr) = @_;
      $attr->{class} = 'newclass';
    });
    

    Remove elements:

    $parser->handler(discard_element => "script");
    

    Modify text:

    $parser->handler(text => sub {
      my $text = shift;
      $text =~ s/foo/bar/g;
      return $text;
    });
    

    Insert elements:

    $parser->handler(start => sub {
      my $elem = shift;
      $elem->push_content("<div>New elem</div>");
    });
    

    Handlers and Events

    Start handler:

    $parser->handler(start => \\&start, "div");
    
    sub start {
      my ($tag, $attr, $self) = @_;
    
      print "Start $tag\\n";
    }
    

    End handler:

    $parser->handler(end => \\&end, "div");
    
    sub end {
      print "End div\\n";
    }
    

    Available events: text, comment, process, declaration

    Tree Traversal

    $parser->handler(start => sub {
      my $elem = $_->[1];
      $elem->traverse(\\&process);
    });
    
    sub process {
      my $node = shift;
    
      # process node
    }
    
    my $parent = $node->parent;
    my @children = $node->content_list;
    

    Integration

    Web scraping:

    use Web::Scraper;
    
    my $scraper = scraper {
      process "div.results", "links[]" => scraper::attr("href");
    };
    
    $scraper->parse_html($parser, $html);
    

    Mojolicious:

    get '/' => sub {
      my $parser = HTML::Parser->new;
      $parser->parse($html);
    
      # process parser
    
      $self->render;
    };
    

    Feeds:

    use HTML::Parser;
    use XML::FeedPP;
    
    my $parser = HTML::Parser->new;
    my $feed = XML::FeedPP->new(handlers => $parser);
    $feed->parse($feed_xml);
    

    Parsing Edge Cases

    Malformed HTML:

    Use junk_text option to ignore errors:

    $parser->junk_text(1);
    

    Extract data from invalid markup:

    $parser->handler(text => sub {
      my $text = shift;
      if($text =~ /(\\d{4}-\\d{2}-\\d{2})/) {
        return $1; # extract date
      }
    });
    

    Parse fragments:

    $parser->parse_fragment($html);
    

    Embedded content:

    $parser->handler(text => sub {
      my $text = shift;
      if($text =~ /<style/i) {
        $text = ""; # remove CSS
      }
      return $text;
    });
    

    Best Practices

  • Use strict parsing for clean HTML
  • Avoid regex when possible - parse then extract
  • Benchmark different parsing options
  • Handle character encoding explicitly
  • Troubleshooting

    Debugging:

    use HTML::Parser::Debug;
    $debug_parser->parse($html);
    

    Common bugs:

  • Infinite loops from recursive handlers
  • Memory leaks if tree isn't freed
  • Encoding issues
  • Buggy regex losing text
  • Error handling:

    eval {
      $parser->parse($html);
    };
    if($@) {
      print "Parse failed: $@";
    }
    

    Customizing

    Custom parsers:

    package MyParser;
    @ISA = ('HTML::Parser');
    
    sub start {
      my $self = shift;
      # custom logic
    }
    

    Extending:

    use HTML::Parser::Plugins;
    $parser->plugin(MyPlugin);
    

    Modifying architecture:

    @HTML::Parser::ISA = qw(MyParser);
    

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!