The Complete Python HTML Parser Cheatsheet

Jan 9, 2024 ยท 8 min read

The html parser built into Python allows you to parse HTML and XML documents and extract data from them. This comprehensive cheatsheet provides everything you need to know to fully utilize this useful package.

Getting Started

Import the HTMLParser module:

from html.parser import HTMLParser

Create a parser class inheriting from HTMLParser:

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print(f"Encountered a start tag: {tag}")

    def handle_data(self, data):
        print(f"Encountered some data: {data}")

parser = MyHTMLParser()

Feed some HTML to the parser:

html = """<html><head><title>Test Parser</title><body><h1>Hello World!</h1></body></html>"""

parser.feed(html)

The parser will call methods to handle tags and data during parsing.

Parsing HTML/XML

The html parser can handle HTML as well as XML syntax. Here is an example XML document:

<note>
 <to>George</to>
 <from>John</from>
 <heading>Reminder</heading>
 <body>Don't forget the meeting!</body>
</note>

Parsing XML is the same process as HTML using the same parser and methods.

Parsing Strategies

There are several approaches and packages available for parsing HTML and XML in Python:

Built-in HTML Parser

  • Python's built-in parser from html.parser
  • Best for:
  • BeautifulSoup

  • Extremely popular 3rd party package
  • More features for complex parsing
  • Additional dependencies
  • Best for:
  • Regular Expressions

  • Can parse simple markup with regex patterns
  • Best for:
  • XML Parsers

  • Dedicated XML parsing packages like lxml
  • Best for:
  • In many cases Python's built-in HTML parser is your best choice for basic to intermediate parsing needs.

    Parsing Document Fragments

    You don't need to feed the parser a full HTML document, it works on any document fragment:

    fragment = """<div><p>This is a <b>fragment</b></p><p>Of <i>HTML</i> without metadata</p></div>"""
    
    parser.feed(fragment)
    

    Useful for parsing HTML snippets from larger documents or templates.

    Asynchronous Parsing

    The parser can be used asynchronously with Python's asyncio module:

    import asyncio
    
    async def parse_async(html):
       parser = MyHTMLParser()
       await loop.run_in_executor(None, parser.feed, html)
       return parser.get_data()
    
    # Get data without blocking
    data = await parse_async(some_html)
    

    Helpful for parsing multiple pages concurrently without blocking.

    Parsing Methods

    These are the main parsing methods you can override in a subclass:

    handle_starttag(tag, attrs)

    Called for each starting tag:

    def handle_starttag(self, tag, attrs):
        print(f"Encountered start tag: {tag}")
        attrs_str = "".join([f' {name}="{value}"' for name, value in attrs])
        print(f"<{tag}{attrs_str}>")
    

    Attributes are passed in as a list of (name, value) tuples.

    handle_endtag(tag)

    Called for each ending tag:

    def handle_endtag(self, tag):
        print(f"Encountered end tag: {tag}")
    

    handle_data(data)

    Called for text blocks between tags:

    def handle_data(self, data):
        print(f"Encountered data: {data}")
    

    Useful for extracting text content.

    handle_comment(data)

    Called for HTML comments:

    def handle_comment(self, data):
        print(f"Comment: {data}")
    

    handle_entityref(name)

    Called for entity references like & and ©.

    The name does not include the '&' or ';' delimiters.

    def handle_entityref(self, name):
        print(f"Found entity ref: {name}")
    

    handle_charref(name)

    Called for numeric character references like Ӓ.

    The name is the decoded Unicode character.

    def handle_charref(self, name):
       print(f"Found character reference to: {name}")
    

    handle_decl(data)

    Called for DOCTYPE declarations and the XML declaration.

    E.g:

    def handle_decl(self, data):
        print(f"Declaration: {data}")
    

    These cover all the major parsing events.

    Extracting Data

    Store extracted data in your parser subclass instance:

    from html.parser import HTMLParser
    
    class MyParser(HTMLParser):
        def __init__(self):
            HTMLParser.__init__(self)
            self.links = []
    
        def handle_starttag(self, tag, attrs):
            if tag == 'a':
                for name, value in attrs:
                    if name == 'href':
                        self.links.append(value)
    
    parser = MyParser()
    parser.feed(some_html)
    
    for link in parser.links:
        print(link)
    

    You have full access to the extracted data after parsing completes.

    Some ideas:

  • Extract all text with handle_data()
  • Get meta info like titles from tags
  • Build lists of links, images, etc.
  • Parse tables/lists into Python data structures
  • Parsing Attributes

    Tag attributes are passed as a list of (name, value) tuples to start tag methods:

    def handle_starttag(self, tag, attrs):
        print(f"<{tag}>")
    
        for attr in attrs:
            name = attr[0]
            value = attr[1]
            print(f" {name}={value}")
    
        print(f"</{tag}>")
    

    Convenient for accessing attribute values by name.

    You can also access a dictionary of attributes:

    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)
        print(attrs['class']) # Print class attr
    

    Parsing Trees

    The parser generates a parsing tree as it processes a document.

    You can access the tree handlers with:

    handle_startendtag(tag, attrs) # Called for empty tags
    
    handle_starttag(tag, attrs) # On opening tag
    
    handle_endtag(tag) # On closing tag
    

    The nesting of calls represents the tree structure.

    This allows building abstract syntax trees while parsing or pulling data directly from trees.

    Error Handling

    Use try/except to handle parser errors:

    try:
        parser.feed("<html><asdd<<</html")
    except HTMLParseError:
        print("Parsing failed!")
    
    

    Set strict=True to avoid recovering from errors:

    parser = HTMLParser(strict=True)
    

    The parser then stops on the first error found.

    Advanced Techniques

    There are several more advanced techniques available as well:

    Parser Subclasses

    Create parser subclasses targeted for specific parsing goals:

    class LinkParser(HTMLParser):
        # Custom logic to find <a> links
    
    class ImageParser(HTMLParser):
        # Custom logic to find <img> tags
    

    Reuse parsers without altering their logic.

    Web Scraping

    Import your parser into a web scraper to parse pages:

    import requests
    from example import MyParser
    
    def scrape(url):
        page = requests.get(url)
        parser = MyParser()
        parser.feed(page.text)
        return parser.data
    

    Brings together fetching and parsing logic.

    Asynchronous Parsing

    Import asyncio to parse multiple pages concurrently:

    import asyncio
    
    async def parse(url):
        page = await fetch_page(url) # Fetch HTML
        parser = MyParser()
        parser.feed(page)
        return parser.data
    
    urls = ['url1', 'url2', ...]
    loop = asyncio.get_event_loop()
    data = loop.run_until_complete(asyncio.gather(*[parse(url) for url in urls]))
    

    Takes advantage of asynchronous IO for faster parsing.

    XML Integration

    Convert XML to HTML for the parser:

    import xml.dom.minidom
    
    xml = """<note>...</note>"""
    dom = xml.dom.minidom.parseString(xml)
    
    html = dom.toprettyxml() # Convert to HTML
    
    parser.feed(html) # Send to parser
    

    Allows XML parsing with the HTML parser.

    You can also use dedicated XML parsers like lxml for more complete XML support.

    Parsing Tips

    Here are some handy tips for using the html parser effectively:

    Sanitize Input

    Use a library to sanitize input before parsing:

    import bleach
    
    dirty_html = get_tainted_input()
    clean_html = bleach.clean(dirty_html)
    parser.feed(clean_html)
    

    Avoids security issues from malicious input.

    Improve Performance

    The parser is quite fast but you can optimize further:

  • Remove unnecessary handler methods
  • Disable strict parsing with strict=False
  • Use buffering with parser.feed(data) in chunks
  • Run parser in separate thread or process
  • Choose Encoding

    Specify encoding on the parser instance:

    parser = HTMLParser(encoding='utf-8')
    

    Handles issues with encoding mismatches.

    Debug Errors

    Debug errors by handling exceptions:

    try:
        parser.feed(bad_html)
    except HTMLParseError as e:
        print(e.msg) # Print error message
    

    Usually indicates malformed input documents.

    Validate Documents

    Check if a document is valid HTML before parsing:

    import html5validator
    
    is_valid = html5validator.checkValidityOfHtml(document)
    if is_valid:
       parser.feed(document)
    

    Can help narrow down errors.

    Use Cases

    Some examples of common use cases:

    Web Scraping

    Harvesting data from websites:

    class ScraperParser(HTMLParser):
        def __init__(self):
            self.items = []
    
        def handle_data(self, data):
            self.items.append(data)
    
    parser = ScraperParser()
    parser.feed(requests.get("<https://example.com>").text)
    
    print(parser.items)
    

    RSS/Atom Feeds

    Parse syndicated feed content:

    from urllib import request
    
    feed = request.urlopen("<https://example.com/feed>")
    
    parser = FeedParser()
    parser.feed(feed.read())
    
    print(f"Most recent item: {parser.items[0]}")
    

    Email Parsing

    Extract data from HTML email content:

    import imaplib
    
    mail = imaplib.fetch(message_id, "(RFC822)")
    
    if mail.is_html:
        parser = EmailParser()
        parser.feed(mail.html)
        print(parser.get_links())
    

    Static Site Generators

    Use parsed HTML to produce static sites:

    class SiteParser(HTMLParser):
        def __init__(self):
            self.pages = []
    
        def handle_data(self, data):
            self.pages.append(data)
    
    parser = SiteParser()
    parser.feed(template_html)
    
    for page in parser.pages:
        with open(f"{page}.html", "w") as f:
            f.write(render(page))
    

    Automates site generation without a dynamic backend.

    HTML Processing

    Manipulate and process HTML documents:

    class Process(HTMLParser):
        def handle_starttag(self, tag, attrs):
            if tag == "button":
                attrs.append(("disabled", "True"))
    
        def get_html(self):
            return self.output
    
    parser = Process()
    parser.feed(html)
    processed_html = parser.get_html()
    

    Modify, sanitize, or transform HTML programmatically.

    Test HTML Output

    Verify HTML generation:

    expected_html = """
    <html>
    <body>
      Hello world!
    </body>
    </html>
    """
    
    generator = MyHtmlGenerator()
    parser = TestParser()
    parser.feed(generator.output())
    
    assert parser.body == "Hello world!"
    

    Confirm generated HTML matches expectations.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!