The Complete BeautifulSoup Cheatsheet with Examples

Oct 4, 2023 ยท 7 min read

This cheatsheet covers the full BeautifulSoup 4 API with practical examples.

Installation

Say we want to scrape a website:

pip install beautifulsoup4

Import BeautifulSoup module:

from bs4 import BeautifulSoup

Creating a BeautifulSoup Object

Parse HTML string:

html = "<p>Example paragraph</p>"

soup = BeautifulSoup(html, 'html.parser')

Parse from file:

with open("index.html") as file:
  soup = BeautifulSoup(file, 'html.parser')

BeautifulSoup Object Types

When parsing documents and navigating the parse trees, you will encounter the following main object types:

Tag

A Tag corresponds to an HTML or XML tag in the original document:

soup = BeautifulSoup('<p>Hello World</p>')
p_tag = soup.p

p_tag.name # 'p'
p_tag.string # 'Hello World'

Tags contain nested Tags and NavigableStrings.

NavigableString

A NavigableString represents text content without tags:

soup = BeautifulSoup('Hello World')
text = soup.string

text # 'Hello World'
type(text) # bs4.element.NavigableString

BeautifulSoup

The BeautifulSoup object represents the parsed document as a whole. It is the root of the tree:

soup = BeautifulSoup('<html>...</html>')

soup.name # '[document]'
soup.head # <head> Tag element

Comment

Comments in HTML are also available as Comment objects:

<!-- This is a comment -->
comment = soup.find(text=re.compile('This is'))
type(comment) # bs4.element.Comment

Knowing these core object types helps when analyzing, searching, and navigating parsed documents.

Searching the Parse Tree

By Name

HTML:

<div>
  <p>Paragraph 1</p>
  <p>Paragraph 2</p>
</div>

Python:

paragraphs = soup.find_all('p')
# <p>Paragraph 1</p>, <p>Paragraph 2</p>

By Attributes

HTML:

<div id="content">
  <p>Paragraph 1</p>
</div>

Python:

div = soup.find(id="content")
# <div id="content">...</div>

By Text

HTML:

<p>This is some text</p>

Python:

p = soup.find(text="This is some text")
# <p>This is some text</p>

Searching with CSS Selectors

CSS selectors provide a very powerful way to search for elements within a parsed document.

Some examples of CSS selector syntax:

By Tag Name

Select all

tags:

soup.select("p")

By ID

Select element with ID "main":

soup.select("#main")

By Class Name

Select elements with class "article":

soup.select(".article")

By Attribute

Select tags with a "data-category" attribute:

soup.select("[data-category]")

Descendant Combinator

Select paragraphs inside divs:

soup.select("div p")

Child Combinator

Select direct children paragraphs:

soup.select("div > p")

Adjacent Sibling

Select h2 after h1:

soup.select("h1 + h2")

General Sibling

Select h2 after any h1:

soup.select("h1 ~ h2")

By Text

Select elements containing text:

soup.select(":contains('Some text')")

By Attribute Value

Select input with type submit:

soup.select("input[type='submit']")

Pseudo-classes

Select first paragraph:

soup.select("p:first-of-type")

Chaining

Select first article paragraph:

soup.select("article > p:nth-of-type(1)")

Accessing Data

HTML:

<p class="content">Some text</p>

Python:

p = soup.find('p')
p.name # "p"
p.attrs # {"class": "content"}
p.string # "Some text"

The Power of find_all()

The find_all() method is one of the most useful and versatile searching methods in BeautifulSoup.

Returns All Matches

find_all() will find and return a list of all matching elements:

all_paras = soup.find_all('p')

This gives you all paragraphs on a page.

Flexible Queries

You can pass a wide range of queries to find_all():

  • Name - find_all('p')
  • Attributes - find_all('a', class_='external')
  • Text - find_all(text=re.compile('summary'))
  • Limit - find_all('p', limit=2)
  • And more!
  • Useful Features

    Some useful things you can do with find_all():

  • Get a count - len(soup.find_all('p'))
  • Iterate through results - for p in soup.find_all('p'):
  • Convert to text - [p.get_text() for p in soup.find_all('p')]
  • Extract attributes - [a['href'] for a in soup.find_all('a')]
  • Why It's Useful

    In summary, find_all() is useful because:

  • It returns all matching elements
  • It supports diverse and powerful queries
  • It enables easily extracting and processing result data
  • Whenever you need to get a collection of elements from a parsed document, find_all() will likely be your go-to tool.

    Navigating Trees

    Traverse up and sideways through related elements.

    Modifying the Parse Tree

    BeautifulSoup provides several methods for editing and modifying the parsed document tree.

    HTML:

    <p>Original text</p>
    

    Python:

    p = soup.find('p')
    p.string = "New text"
    

    Edit Tag Names

    Change an existing tag name:

    tag = soup.find('span')
    tag.name = 'div'
    

    Edit Attributes

    Add, modify or delete attributes of a tag:

    tag['class'] = 'header' # set attribute
    tag['id'] = 'main'
    
    del tag['class'] # delete attribute
    

    Edit Text

    Change text of a tag:

    tag.string = "New text"
    

    Append text to a tag:

    tag.append("Additional text")
    

    Insert Tags

    Insert a new tag:

    new_tag = soup.new_tag("h1")
    tag.insert_before(new_tag)
    

    Delete Tags

    Remove a tag entirely:

    tag.extract()
    

    Wrap/Unwrap Tags

    Wrap another tag around:

    tag.wrap(soup.new_tag('div))
    

    Unwrap its contents:

    tag.unwrap()
    

    Modifying the parse tree is very useful for cleaning up scraped data or extracting the parts you need.

    Outputting HTML

    Input HTML:

    <p>Hello World</p>
    

    Python:

    print(soup.prettify())
    
    # <p>
    #  Hello World
    # </p>
    

    Integrating with Requests

    Fetch a page:

    import requests
    
    res = requests.get("<https://example.com>")
    soup = BeautifulSoup(res.text, 'html.parser')
    

    Parsing Only Parts of a Document

    When dealing with large documents, you may want to parse only a fragment rather than the whole thing. BeautifulSoup allows for this using SoupStrainers.

    There are a few ways to parse only parts of a document:

    By CSS Selector

    Parse just a selection matching a CSS selector:

    from bs4 import SoupStrainer
    
    only_tables = SoupStrainer("table")
    soup = BeautifulSoup(doc, parse_only=only_tables)
    

    This will parse only the

    tags from the document.

    By Tag Name

    Parse only specific tags:

    only_divs = SoupStrainer("div")
    soup = BeautifulSoup(doc, parse_only=only_divs)
    

    By Function

    Pass a function to test if a tag should be parsed:

    def is_short_string(string):
      return len(string) < 20
    
    only_short_strings = SoupStrainer(string=is_short_string)
    soup = BeautifulSoup(doc, parse_only=only_short_strings)
    

    This parses tags based on their text content.

    By Attributes

    Parse tags that contain specific attributes:

    has_data_attr = SoupStrainer(attrs={"data-category": True})
    soup = BeautifulSoup(doc, parse_only=has_data_attr)
    

    Multiple Conditions

    You can combine multiple strainers:

    strainer = SoupStrainer("div", id="main")
    soup = BeautifulSoup(doc, parse_only=strainer)
    

    This will parse only

    .

    Parsing only parts you need can help reduce memory usage and improve performance when scraping large documents.

    Dealing with Encoding

    When parsing documents, you may encounter encoding issues. Here are some ways to handle encoding:

    Specify at Parse Time

    Pass the from_encoding parameter when creating the BeautifulSoup object:

    soup = BeautifulSoup(doc, from_encoding='utf-8')
    

    This handles any decoding needed when initially parsing the document.

    Encode Tag Contents

    You can encode the contents of a tag:

    tag.string.encode("utf-8")
    

    Use this when outputting tag strings.

    Encode Entire Document

    To encode the entire BeautifulSoup document:

    soup.encode("utf-8")
    

    This returns a byte string with the encoded document.

    Pretty Print with Encoding

    Specify encoding when pretty printing output:

    print(soup.prettify(encoder="utf-8"))
    

    Unicode Dammit

    BeautifulSoup's UnicodeDammit class can detect and convert incoming documents to Unicode:

    from bs4 import UnicodeDammit
    
    dammit = UnicodeDammit(doc)
    soup = dammit.unicode_markup
    

    This converts even poorly encoded documents to Unicode.

    Properly handling encoding ensures your scraped data is decoded and output correctly when using BeautifulSoup.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!