Formatting HTML with BeautifulSoup's prettify()

Oct 6, 2023 ยท 2 min read

When parsing HTML using BeautifulSoup in Python, the prettify() method is handy for formatting and printing the HTML in a more readable way.

What prettify() Does

The prettify() method takes a BeautifulSoup object and returns a string containing the parsed HTML formatted with proper whitespace and indentation.

For example:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

Instead of printing a long single line of HTML text, it will print something like:

<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h1>
   Main Heading
  </h1>
  <p>
   Lorem ipsum dolor sit amet.
  </p>
 </body>
</html>

Making the HTML much easier to read!

Specifying Encoder

By default prettify() uses UTF-8 encoding. You can change this using the encoder argument:

print(soup.prettify(encoder="latin-1"))

Output to a File

To store the formatted HTML in a file, open a file for writing bytes and pass prettify() contents to it:

with open("formatted.html", "wb") as file:
    file.write(soup.prettify(encoder="utf-8"))

This persists the reformatted HTML to disk.

Limitations

One caveat is that prettify() won't fix or restructure poorly formatted HTML. It mainly just spaces out elements and attributes cleanly.

Overall, prettify() is invaluable for debugging and visually inspecting HTML during web scraping with BeautifulSoup.

Browse by tags:

Browse by language:

Tired of getting blocked while scraping the web?

ProxiesAPI handles headless browsers and rotates proxies for you.
Get access to 1,000 free API credits, no credit card required!