Is BeautifulSoup lxml or HTML?

BeautifulSoup is one of the most popular Python libraries for parsing HTML and XML documents. But there is often confusion around whether BeautifulSoup itself parses the documents, or whether it uses other parsers like lxml and html.parser under the hood.

BeautifulSoup Doesn't Actually Parse Documents

The key thing to understand is that BeautifulSoup provides a nice API for navigating and searching an HTML/XML document, but it doesn't contain a parser itself. It uses other parsers to actually convert the raw document data into a parsable structure.

The most common parsers BeautifulSoup can use are:

lxml - An extremely fast C-based parser. Recommended for production environments.

html.parser - Python's built-in HTML parser. Decent performance.

html5lib - Slower but very lenient parser. Handles badly formatted markup.

By default, BeautifulSoup will auto-detect and use the best parser available on your system.

from bs4 import BeautifulSoup

soup = BeautifulSoup(my_html_doc) #auto-selects the best parser

You can also explicitly state which parser you want it to use:

soup = BeautifulSoup(my_html_doc, 'lxml')

Practical Implications

The main thing this means in practice is that if you want BeautifulSoup to handle bad HTML, you may need to explicitly use html5lib, since lxml will choke on badly formatted documents.

It also means that if you're doing heavy parsing, installing and using lxml can provide a nice performance boost over the built-in html.parser.

BeautifulSoup is really just an interface to other parsers. It provides great methods and Pythonic idioms for navigating, searching, and modifying parsed document trees. But it doesn't contain an HTML parser itself - it offloads that work to other specialized parsers.

Is BeautifulSoup lxml or HTML?

BeautifulSoup Doesn't Actually Parse Documents

Practical Implications

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Is BeautifulSoup lxml or HTML?

BeautifulSoup Doesn't Actually Parse Documents

Practical Implications

The easiest way to do Web Scraping

Don't leave just yet!