Parsing HTML Tables with BeautifulSoup

BeautifulSoup is a useful library for extracting data from HTML tables in Python. With a few simple lines of code, you can parse an HTML table and convert it into a pandas DataFrame for further analysis.

Parsing the Table

To parse an HTML table with BeautifulSoup, first load the HTML document and find the

tag.

You can then loop through each

row and

cell, appending the data to lists:

from bs4 import BeautifulSoup
import requests

url = '<https://example.com/table>'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')

table = soup.find('table')

rows = []
for row in table.find_all('tr'):
    rows.append([val.text for val in row.find_all('td')])

This gives you a list of lists containing each cell's text.

Converting to DataFrame

To convert to a pandas DataFrame, pass the list of rows along with column names:

import pandas as pd

df = pd.DataFrame(rows, columns=['Name', 'Age', 'Job'])
print(df)

The DataFrame will contain the nicely structured table data.

Extracting Attributes

You can also extract other attributes like href links from table cells:

rows = []
for row in table.find_all('tr'):
  cells = [cell.find('a').get('href') for cell in row.find_all('td')]
  rows.append(cells)

Converting Strings

To extract a table from a BeautifulSoup string, parse it first:

html = "<table>...</table>"
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table')

Then continue parsing as normal.

In summary, BeautifulSoup makes extracting data from HTML tables very straightforward. Pairing it with pandas gives you powerful data analysis capabilities over scraped tabular data.

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you

Try ProxiesAPI for free

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"