The Complete Python HTML Parser Cheatsheet

The html parser built into Python allows you to parse HTML and XML documents and extract data from them. This comprehensive cheatsheet provides everything you need to know to fully utilize this useful package.

Getting Started

Import the HTMLParser module:

from html.parser import HTMLParser

Create a parser class inheriting from HTMLParser:

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print(f"Encountered a start tag: {tag}")

    def handle_data(self, data):
        print(f"Encountered some data: {data}")

parser = MyHTMLParser()

Feed some HTML to the parser:

html = """<html><head><title>Test Parser</title><body><h1>Hello World!</h1></body></html>"""

parser.feed(html)

The parser will call methods to handle tags and data during parsing.

Parsing HTML/XML

The html parser can handle HTML as well as XML syntax. Here is an example XML document:

<note>
 <to>George</to>
 <from>John</from>
 <heading>Reminder</heading>
 <body>Don't forget the meeting!</body>
</note>

Parsing XML is the same process as HTML using the same parser and methods.

Parsing Strategies

There are several approaches and packages available for parsing HTML and XML in Python:

Built-in HTML Parser

Python's built-in parser from html.parser

Best for:

BeautifulSoup

Extremely popular 3rd party package

More features for complex parsing

Additional dependencies

Best for:

Regular Expressions

Can parse simple markup with regex patterns

Best for:

XML Parsers

Dedicated XML parsing packages like lxml

Best for:

In many cases Python's built-in HTML parser is your best choice for basic to intermediate parsing needs.

Parsing Document Fragments

You don't need to feed the parser a full HTML document, it works on any document fragment:

fragment = """<div><p>This is a <b>fragment</b></p><p>Of <i>HTML</i> without metadata</p></div>"""

parser.feed(fragment)

Useful for parsing HTML snippets from larger documents or templates.

Asynchronous Parsing

The parser can be used asynchronously with Python's asyncio module:

import asyncio

async def parse_async(html):
   parser = MyHTMLParser()
   await loop.run_in_executor(None, parser.feed, html)
   return parser.get_data()

# Get data without blocking
data = await parse_async(some_html)

Helpful for parsing multiple pages concurrently without blocking.

Parsing Methods

These are the main parsing methods you can override in a subclass:

handle_starttag(tag, attrs)

Called for each starting tag:

def handle_starttag(self, tag, attrs):
    print(f"Encountered start tag: {tag}")
    attrs_str = "".join([f' {name}="{value}"' for name, value in attrs])
    print(f"<{tag}{attrs_str}>")

Attributes are passed in as a list of (name, value) tuples.

handle_endtag(tag)

Called for each ending tag:

def handle_endtag(self, tag):
    print(f"Encountered end tag: {tag}")

handle_data(data)

Called for text blocks between tags:

def handle_data(self, data):
    print(f"Encountered data: {data}")

Useful for extracting text content.

handle_comment(data)

Called for HTML comments:

def handle_comment(self, data):
    print(f"Comment: {data}")

handle_entityref(name)

Called for entity references like & and ©.

The name does not include the '&' or ';' delimiters.

def handle_entityref(self, name):
    print(f"Found entity ref: {name}")

handle_charref(name)

Called for numeric character references like Ӓ.

The name is the decoded Unicode character.

def handle_charref(self, name):
   print(f"Found character reference to: {name}")

handle_decl(data)

Called for DOCTYPE declarations and the XML declaration.

E.g:

def handle_decl(self, data):
    print(f"Declaration: {data}")

These cover all the major parsing events.

Extracting Data

Store extracted data in your parser subclass instance:

from html.parser import HTMLParser

class MyParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.links = []

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for name, value in attrs:
                if name == 'href':
                    self.links.append(value)

parser = MyParser()
parser.feed(some_html)

for link in parser.links:
    print(link)

You have full access to the extracted data after parsing completes.

Some ideas:

Extract all text with handle_data()

Get meta info like titles from tags

Build lists of links, images, etc.

Parse tables/lists into Python data structures

Parsing Attributes

Tag attributes are passed as a list of (name, value) tuples to start tag methods:

def handle_starttag(self, tag, attrs):
    print(f"<{tag}>")

    for attr in attrs:
        name = attr[0]
        value = attr[1]
        print(f" {name}={value}")

    print(f"</{tag}>")

Convenient for accessing attribute values by name.

You can also access a dictionary of attributes:

def handle_starttag(self, tag, attrs):
    attrs = dict(attrs)
    print(attrs['class']) # Print class attr

Parsing Trees

The parser generates a parsing tree as it processes a document.

You can access the tree handlers with:

handle_startendtag(tag, attrs) # Called for empty tags

handle_starttag(tag, attrs) # On opening tag

handle_endtag(tag) # On closing tag

The nesting of calls represents the tree structure.

This allows building abstract syntax trees while parsing or pulling data directly from trees.

Error Handling

Use try/except to handle parser errors:

try:
    parser.feed("<html><asdd<<</html")
except HTMLParseError:
    print("Parsing failed!")

Set strict=True to avoid recovering from errors:

parser = HTMLParser(strict=True)

The parser then stops on the first error found.

Advanced Techniques

There are several more advanced techniques available as well:

Parser Subclasses

Create parser subclasses targeted for specific parsing goals:

class LinkParser(HTMLParser):
    # Custom logic to find <a> links

class ImageParser(HTMLParser):
    # Custom logic to find <img> tags

Reuse parsers without altering their logic.

Web Scraping

Import your parser into a web scraper to parse pages:

import requests
from example import MyParser

def scrape(url):
    page = requests.get(url)
    parser = MyParser()
    parser.feed(page.text)
    return parser.data

Brings together fetching and parsing logic.

Asynchronous Parsing

Import asyncio to parse multiple pages concurrently:

import asyncio

async def parse(url):
    page = await fetch_page(url) # Fetch HTML
    parser = MyParser()
    parser.feed(page)
    return parser.data

urls = ['url1', 'url2', ...]
loop = asyncio.get_event_loop()
data = loop.run_until_complete(asyncio.gather(*[parse(url) for url in urls]))

Takes advantage of asynchronous IO for faster parsing.

XML Integration

Convert XML to HTML for the parser:

import xml.dom.minidom

xml = """<note>...</note>"""
dom = xml.dom.minidom.parseString(xml)

html = dom.toprettyxml() # Convert to HTML

parser.feed(html) # Send to parser

Allows XML parsing with the HTML parser.

You can also use dedicated XML parsers like lxml for more complete XML support.

Parsing Tips

Here are some handy tips for using the html parser effectively:

Sanitize Input

Use a library to sanitize input before parsing:

import bleach

dirty_html = get_tainted_input()
clean_html = bleach.clean(dirty_html)
parser.feed(clean_html)

Avoids security issues from malicious input.

Improve Performance

The parser is quite fast but you can optimize further:

Remove unnecessary handler methods

Disable strict parsing with strict=False

Use buffering with parser.feed(data) in chunks

Run parser in separate thread or process

Choose Encoding

Specify encoding on the parser instance:

parser = HTMLParser(encoding='utf-8')

Handles issues with encoding mismatches.

Debug Errors

Debug errors by handling exceptions:

try:
    parser.feed(bad_html)
except HTMLParseError as e:
    print(e.msg) # Print error message

Usually indicates malformed input documents.

Validate Documents

Check if a document is valid HTML before parsing:

import html5validator

is_valid = html5validator.checkValidityOfHtml(document)
if is_valid:
   parser.feed(document)

Can help narrow down errors.

Use Cases

Some examples of common use cases:

Web Scraping

Harvesting data from websites:

class ScraperParser(HTMLParser):
    def __init__(self):
        self.items = []

    def handle_data(self, data):
        self.items.append(data)

parser = ScraperParser()
parser.feed(requests.get("<https://example.com>").text)

print(parser.items)

RSS/Atom Feeds

Parse syndicated feed content:

from urllib import request

feed = request.urlopen("<https://example.com/feed>")

parser = FeedParser()
parser.feed(feed.read())

print(f"Most recent item: {parser.items[0]}")

Email Parsing

Extract data from HTML email content:

import imaplib

mail = imaplib.fetch(message_id, "(RFC822)")

if mail.is_html:
    parser = EmailParser()
    parser.feed(mail.html)
    print(parser.get_links())

Static Site Generators

Use parsed HTML to produce static sites:

class SiteParser(HTMLParser):
    def __init__(self):
        self.pages = []

    def handle_data(self, data):
        self.pages.append(data)

parser = SiteParser()
parser.feed(template_html)

for page in parser.pages:
    with open(f"{page}.html", "w") as f:
        f.write(render(page))

Automates site generation without a dynamic backend.

HTML Processing

Manipulate and process HTML documents:

class Process(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == "button":
            attrs.append(("disabled", "True"))

    def get_html(self):
        return self.output

parser = Process()
parser.feed(html)
processed_html = parser.get_html()

Modify, sanitize, or transform HTML programmatically.

Test HTML Output

Verify HTML generation:

expected_html = """
<html>
<body>
  Hello world!
</body>
</html>
"""

generator = MyHtmlGenerator()
parser = TestParser()
parser.feed(generator.output())

assert parser.body == "Hello world!"

Confirm generated HTML matches expectations.

The Complete Python HTML Parser Cheatsheet

Getting Started

Parsing HTML/XML

Parsing Strategies

Built-in HTML Parser

BeautifulSoup

Regular Expressions

XML Parsers

Parsing Document Fragments

Asynchronous Parsing

Parsing Methods

handle_starttag(tag, attrs)

handle_endtag(tag)

handle_data(data)

handle_comment(data)

handle_entityref(name)

handle_charref(name)

handle_decl(data)

Extracting Data

Parsing Attributes

Parsing Trees

Error Handling

Advanced Techniques

Parser Subclasses

Web Scraping

Asynchronous Parsing

XML Integration

Parsing Tips

Sanitize Input

Improve Performance

Choose Encoding

Debug Errors

Validate Documents

Use Cases

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

The Complete Python HTML Parser Cheatsheet

Getting Started

Parsing HTML/XML

Parsing Strategies

Built-in HTML Parser

BeautifulSoup

Regular Expressions

XML Parsers

Parsing Document Fragments

Asynchronous Parsing

Parsing Methods

handle_starttag(tag, attrs)

handle_endtag(tag)

handle_data(data)

handle_comment(data)

handle_entityref(name)

handle_charref(name)

handle_decl(data)

Extracting Data

Parsing Attributes

Parsing Trees

Error Handling

Advanced Techniques

Parser Subclasses

Web Scraping

Asynchronous Parsing

XML Integration

Parsing Tips

Sanitize Input

Improve Performance

Choose Encoding

Debug Errors

Validate Documents

Use Cases

The easiest way to do Web Scraping

Don't leave just yet!