Parsing HTML Tables with BeautifulSoup

Oct 6, 2023 ยท 2 min read

BeautifulSoup is a useful library for extracting data from HTML tables in Python. With a few simple lines of code, you can parse an HTML table and convert it into a pandas DataFrame for further analysis.

Parsing the Table

To parse an HTML table with BeautifulSoup, first load the HTML document and find the

tag.

You can then loop through each

row and
cell, appending the data to lists:

from bs4 import BeautifulSoup
import requests

url = '<https://example.com/table>'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')

table = soup.find('table')

rows = []
for row in table.find_all('tr'):
    rows.append([val.text for val in row.find_all('td')])

This gives you a list of lists containing each cell's text.

Converting to DataFrame

To convert to a pandas DataFrame, pass the list of rows along with column names:

import pandas as pd

df = pd.DataFrame(rows, columns=['Name', 'Age', 'Job'])
print(df)

The DataFrame will contain the nicely structured table data.

Extracting Attributes

You can also extract other attributes like href links from table cells:

rows = []
for row in table.find_all('tr'):
  cells = [cell.find('a').get('href') for cell in row.find_all('td')]
  rows.append(cells)

Converting Strings

To extract a table from a BeautifulSoup string, parse it first:

html = "<table>...</table>"
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table')

Then continue parsing as normal.

In summary, BeautifulSoup makes extracting data from HTML tables very straightforward. Pairing it with pandas gives you powerful data analysis capabilities over scraped tabular data.

Tired of getting blocked while scraping the web?

ProxiesAPI handles headless browsers and rotates proxies for you.
Get access to 1,000 free API credits, no credit card required!