The Ultimate HTML::Parser Perl Cheat Sheet

Oct 31, 2023 ยท 4 min read

HTML::Parser is a Perl module that parses HTML/XML documents and provides access to their elements for content extraction and manipulation.

Getting Started

Installation:

cpan HTML::Parser
# or from package manager

HTML::Parser is distributed on CPAN. It can also be installed from most Perl package managers.

Simple parsing:

use HTML::Parser;

my $parser = HTML::Parser->new;
$parser->parse($html);

my $text = $parser->text;
print $text;

This parses an HTML string and prints the extracted text content.

Concepts:

  • Parser creates a tree from HTML/XML input
  • Handlers called on events during parsing
  • Tree can be traversed/manipulated after parsing
  • Parsing HTML

    From string:

    $parser->parse($html_string);
    

    From file:

    open(my $fh, '<', $file);
    $parser->parse_file($fh);
    

    From URL:

    use LWP::Simple;
    my $html = get($url);
    $parser->parse($html);
    

    Parse options:

    $parser->strict(1); # die on invalid HTML
    
    $parser->junk_text(1); # ignore text outside of elements
    
    $parser->parse_fragment($html); # parse partial HTML
    

    Accessing Elements

    By tag name:

    my @divs = $parser->elements_by_tagname('div');
    

    By attribute:

    my @inputs = $parser->elements_with_attribute('name');
    

    With XPath:

    my @links = $parser->xpath('//a');
    

    Extract links:

    my @links = $parser->extract_links();
    

    Element content:

    my $content = $element->[0]; # inside HTML
    

    Manipulating HTML

    Modify elements:

    $parser->handler(start => sub {
      my ($tag, $attr) = @_;
      $attr->{class} = 'newclass';
    });
    

    Remove elements:

    $parser->handler(discard_element => "script");
    

    Modify text:

    $parser->handler(text => sub {
      my $text = shift;
      $text =~ s/foo/bar/g;
      return $text;
    });
    

    Insert elements:

    $parser->handler(start => sub {
      my $elem = shift;
      $elem->push_content("<div>New elem</div>");
    });
    

    Handlers and Events

    Start handler:

    $parser->handler(start => \\&start, "div");
    
    sub start {
      my ($tag, $attr, $self) = @_;
    
      print "Start $tag\\n";
    }
    

    End handler:

    $parser->handler(end => \\&end, "div");
    
    sub end {
      print "End div\\n";
    }
    

    Available events: text, comment, process, declaration

    Tree Traversal

    $parser->handler(start => sub {
      my $elem = $_->[1];
      $elem->traverse(\\&process);
    });
    
    sub process {
      my $node = shift;
    
      # process node
    }
    
    my $parent = $node->parent;
    my @children = $node->content_list;
    

    Integration

    Web scraping:

    use Web::Scraper;
    
    my $scraper = scraper {
      process "div.results", "links[]" => scraper::attr("href");
    };
    
    $scraper->parse_html($parser, $html);
    

    Mojolicious:

    get '/' => sub {
      my $parser = HTML::Parser->new;
      $parser->parse($html);
    
      # process parser
    
      $self->render;
    };
    

    Feeds:

    use HTML::Parser;
    use XML::FeedPP;
    
    my $parser = HTML::Parser->new;
    my $feed = XML::FeedPP->new(handlers => $parser);
    $feed->parse($feed_xml);
    

    Parsing Edge Cases

    Malformed HTML:

    Use junk_text option to ignore errors:

    $parser->junk_text(1);
    

    Extract data from invalid markup:

    $parser->handler(text => sub {
      my $text = shift;
      if($text =~ /(\\d{4}-\\d{2}-\\d{2})/) {
        return $1; # extract date
      }
    });
    

    Parse fragments:

    $parser->parse_fragment($html);
    

    Embedded content:

    $parser->handler(text => sub {
      my $text = shift;
      if($text =~ /<style/i) {
        $text = ""; # remove CSS
      }
      return $text;
    });
    

    Best Practices

  • Use strict parsing for clean HTML
  • Avoid regex when possible - parse then extract
  • Benchmark different parsing options
  • Handle character encoding explicitly
  • Troubleshooting

    Debugging:

    use HTML::Parser::Debug;
    $debug_parser->parse($html);
    

    Common bugs:

  • Infinite loops from recursive handlers
  • Memory leaks if tree isn't freed
  • Encoding issues
  • Buggy regex losing text
  • Error handling:

    eval {
      $parser->parse($html);
    };
    if($@) {
      print "Parse failed: $@";
    }
    

    Customizing

    Custom parsers:

    package MyParser;
    @ISA = ('HTML::Parser');
    
    sub start {
      my $self = shift;
      # custom logic
    }
    

    Extending:

    use HTML::Parser::Plugins;
    $parser->plugin(MyPlugin);
    

    Modifying architecture:

    @HTML::Parser::ISA = qw(MyParser);
    

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: