Stripping HTML Tags from Text with BeautifulSoup

Oct 6, 2023 ยท 2 min read

When scraping web pages, you'll often want to extract just the text content without all the surrounding HTML tags. Here's how to use BeautifulSoup to cleanly strip out tags and isolate the text.

The get_text() Method

The simplest way is using the get_text() method on either a BeautifulSoup object or an individual tag element:

from bs4 import BeautifulSoup

html = "<p>Example text <b>with</b> <i>some</i> tags</p>"
soup = BeautifulSoup(html, 'html.parser')

print(soup.get_text())
# "Example text with some tags"

This strips out all tags and returns just the text.

Stripping Tags from Strings

You can also call get_text() on NavigableStrings directly:

text = soup.p.string
print(text.get_text())

Use this when dealing with a single text element.

Removing Whitespace

To also strip excess whitespace and newline characters:

print(soup.get_text(strip=True))
# "Example text with some tags"

The strip parameter removes whitespace.

Extracting HTML Attributes

To extract specific HTML attributes from tags:

for link in soup.find_all('a'):
  print(link.get('href')) # Prints attribute value

This loops through tags and prints the href.

In summary, get_text() is the primary tool for clearly extracting just text content from HTML with BeautifulSoup. Pair it with attribute extraction to pull text and attributes from tags.

Browse by tags:

Browse by language:

Tired of getting blocked while scraping the web?

ProxiesAPI handles headless browsers and rotates proxies for you.
Get access to 1,000 free API credits, no credit card required!