Finding Headers in BeautifulSoup

Oct 6, 2023 ยท 2 min read

When parsing HTML and XML documents, accessing and working with headers is a common task. In BeautifulSoup, headers like to tags have some particular behaviors and access patterns it's useful to understand.

Finding Headers

To find header tags, you can use:


This will match the first h1, all h2 tags, or all h3 tags respectively.

Contents Access

The main contents of a header tag can be accessed through the .string attribute:

h1 = soup.find('h1')
title_text = h1.string

The .text attribute also works but handles nested tags differently.

Stripping Whitespace

Header tags often contain extra whitespace around them. You can strip whitespace with:

title = h1.get_text(strip=True)

Or for multiline headers:

title = h1.text.strip()

Heading Levels

To get the heading level (e.g. 1 for

), use:

level =[1]

This extracts the number from the tag name.

Next Sibling

A common pattern is finding a header and then extracting the next sibling element:

h1 = soup.find('h1')
content = h1.next_sibling

This gets the element immediately following the header.


In summary, remember headers can be accessed like any other tag but have some useful attributes and patterns like:

  • Using .string for contents
  • Stripping whitespace
  • Extracting the heading level
  • Grabbing next siblings
  • Mastering these header nuances will help you better parse and process documents in BeautifulSoup.

