Stripping HTML Tags from Text with BeautifulSoup

When scraping web pages, you'll often want to extract just the text content without all the surrounding HTML tags. Here's how to use BeautifulSoup to cleanly strip out tags and isolate the text.

The get_text() Method

The simplest way is using the get_text() method on either a BeautifulSoup object or an individual tag element:

from bs4 import BeautifulSoup

html = "<p>Example text <b>with</b> <i>some</i> tags</p>"
soup = BeautifulSoup(html, 'html.parser')

print(soup.get_text())
# "Example text with some tags"

This strips out all tags and returns just the text.

Stripping Tags from Strings

You can also call get_text() on NavigableStrings directly:

text = soup.p.string
print(text.get_text())

Use this when dealing with a single text element.

Removing Whitespace

To also strip excess whitespace and newline characters:

print(soup.get_text(strip=True))
# "Example text with some tags"

The strip parameter removes whitespace.

Extracting HTML Attributes

To extract specific HTML attributes from tags:

for link in soup.find_all('a'):
  print(link.get('href')) # Prints attribute value

This loops through tags and prints the href.

In summary, get_text() is the primary tool for clearly extracting just text content from HTML with BeautifulSoup. Pair it with attribute extraction to pull text and attributes from tags.

Browse by language:

Stripping HTML Tags from Text with BeautifulSoup

The get_text() Method

Stripping Tags from Strings

Removing Whitespace

Extracting HTML Attributes

Browse by tags:

The easiest way to do Web Scraping

Stripping HTML Tags from Text with BeautifulSoup

The get_text() Method

Stripping Tags from Strings

Removing Whitespace

Extracting HTML Attributes

The easiest way to do Web Scraping

Don't leave just yet!