Is BeautifulSoup lxml or HTML?

Feb 5, 2024 ยท 2 min read

BeautifulSoup is one of the most popular Python libraries for parsing HTML and XML documents. But there is often confusion around whether BeautifulSoup itself parses the documents, or whether it uses other parsers like lxml and html.parser under the hood.

BeautifulSoup Doesn't Actually Parse Documents

The key thing to understand is that BeautifulSoup provides a nice API for navigating and searching an HTML/XML document, but it doesn't contain a parser itself. It uses other parsers to actually convert the raw document data into a parsable structure.

The most common parsers BeautifulSoup can use are:

  • lxml - An extremely fast C-based parser. Recommended for production environments.
  • html.parser - Python's built-in HTML parser. Decent performance.
  • html5lib - Slower but very lenient parser. Handles badly formatted markup.
  • By default, BeautifulSoup will auto-detect and use the best parser available on your system.

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(my_html_doc) #auto-selects the best parser

    You can also explicitly state which parser you want it to use:

    soup = BeautifulSoup(my_html_doc, 'lxml') 

    Practical Implications

    The main thing this means in practice is that if you want BeautifulSoup to handle bad HTML, you may need to explicitly use html5lib, since lxml will choke on badly formatted documents.

    It also means that if you're doing heavy parsing, installing and using lxml can provide a nice performance boost over the built-in html.parser.

    BeautifulSoup is really just an interface to other parsers. It provides great methods and Pythonic idioms for navigating, searching, and modifying parsed document trees. But it doesn't contain an HTML parser itself - it offloads that work to other specialized parsers.

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: