Finding Headers in BeautifulSoup

Oct 6, 2023 ยท 2 min read

When parsing HTML and XML documents, accessing and working with headers is a common task. In BeautifulSoup, headers like to tags have some particular behaviors and access patterns it's useful to understand.

Finding Headers

To find header tags, you can use:

soup.find('h1')
soup.find_all('h2')
soup.select('h3')

This will match the first h1, all h2 tags, or all h3 tags respectively.

Contents Access

The main contents of a header tag can be accessed through the .string attribute:

h1 = soup.find('h1')
title_text = h1.string

The .text attribute also works but handles nested tags differently.

Stripping Whitespace

Header tags often contain extra whitespace around them. You can strip whitespace with:

title = h1.get_text(strip=True)

Or for multiline headers:

title = h1.text.strip()

Heading Levels

To get the heading level (e.g. 1 for

), use:

level = h1.name[1]

This extracts the number from the tag name.

Next Sibling

A common pattern is finding a header and then extracting the next sibling element:

h1 = soup.find('h1')
content = h1.next_sibling

This gets the element immediately following the header.

Conclusion

In summary, remember headers can be accessed like any other tag but have some useful attributes and patterns like:

  • Using .string for contents
  • Stripping whitespace
  • Extracting the heading level
  • Grabbing next siblings
  • Mastering these header nuances will help you better parse and process documents in BeautifulSoup.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!