Scraping Multiple Pages with Python and BeautifulSoup

Web scraping is a useful technique for programmatically extracting data from websites. Often you need to scrape multiple pages from a site to gather complete information. In this article, we will see how to scrape multiple pages using Python and the BeautifulSoup library.

Prerequisites

To follow along, you'll need:

Basic knowledge of Python

Python 3 installed on your system

requests and beautifulsoup4 libraries installed:

pip install requests beautifulsoup4

Import Modules

We'll need the requests module to make HTTP requests to the pages and BeautifulSoup to parse the HTML:

import requests
from bs4 import BeautifulSoup

Define Base URL

We'll be scraping a blog - https://copyblogger.com/blog/. The page URLs follow a common pattern:

—

<https://copyblogger.com/blog/>
<https://copyblogger.com/blog/page/2/>
<https://copyblogger.com/blog/page/3/>

Let's define a base URL pattern:

base_url = '<https://copyblogger.com/blog/page/{}/>'

The {} will allow us to insert the page number.

Specify Number of Pages

Next, we'll specify how many pages we want to scrape. Let's scrape the first 5 pages:

num_pages_to_scrape = 5

Loop Through Pages

We can now loop from 1 to num_pages_to_scrape and construct the URL for each page:

for page_num in range(1, num_pages_to_scrape + 1):

  # Construct page URL
  url = base_url.format(page_num)

  # Code to scrape each page here

Send Request and Check Response

Inside the loop, we'll use requests.get() to send a GET request to the page URL.

We'll check that the response status code is 200 to ensure the request succeeded:

response = requests.get(url)

if response.status_code == 200:

  # Scrape page

else:

  print(f"Failed to retrieve page {page_num}")

Parse HTML Using BeautifulSoup

If the request succeeds, we can parse the HTML using BeautifulSoup:

soup = BeautifulSoup(response.text, 'html.parser')

This creates a BeautifulSoup object that we can use to extract data.

Extract Data

—

Now within the loop we can use soup to find and extract the desired data from each page.

For example, to get all the article elements:

articles = soup.find_all('article')

We can loop through the articles and extract information like title, URL, author etc.

Full Code

Our full code to scrape 5 pages looks like:

import requests
from bs4 import BeautifulSoup

base_url = '<https://copyblogger.com/blog/page/{}/>'
num_pages_to_scrape = 5

for page_num in range(1, num_pages_to_scrape + 1):

  url = base_url.format(page_num)

  response = requests.get(url)

  if response.status_code == 200:

    soup = BeautifulSoup(response.text, 'html.parser')

    articles = soup.find_all('article')

    for article in articles:

      # Extract data from article

      print(title)
      print(author)

  else:
    print(f"Failed to retrieve page {page_num}")

This allows us to scrape and extract data from multiple pages sequentially. The full code can be extended to scrape any number of pages.

Summary

Use a base URL pattern with {} placeholder

Loop through the pages with range()

Construct each page URL

Send GET request with requests and check response

Parse HTML with BeautifulSoup

Find and extract data inside the loop

Print or store scraped data

Web scraping enables collecting large datasets that can be analyzed programmatically. With the techniques covered here, you can scrape and extract information from multiple pages of a website in Python.

# Updated full code

import requests
from bs4 import BeautifulSoup

base_url = 'https://copyblogger.com/blog/page/{}/'
num_pages_to_scrape = 5

for page_num in range(1, num_pages_to_scrape + 1):

  url = base_url.format(page_num)
  
  response = requests.get(url)
  
  if response.status_code == 200:
  
    soup = BeautifulSoup(response.text, 'html.parser')
    
    articles = soup.find_all('article')
    
    for article in articles:
    
      # Extract the article title
      title = article.find('h2', class_='entry-title').text.strip()

      # Extract the article URL
      article_url = article.find('a', class_='entry-title-link')['href']

      # Extract the author's name
      author_name = article.find('div', class_='post-author').find('a').text.strip()

      # Find the categories container div
      categories_container = article.find('div', class_='entry-categories')

      # Extract the categories 
      if categories_container:
        categories = [cat.text.strip() for cat in categories_container.find_all('a')]
      else:
        categories = []
        
      # Print extracted information  
      print("Title:", title)
      print("URL:", article_url)
      print("Author:", author_name)
      print("Categories:", categories)
      print("\n")
      
  else:
    print(f"Failed to retrieve page {page_num}")

While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

Scraping Multiple Pages with Python and BeautifulSoup

Prerequisites

Import Modules

Define Base URL

Specify Number of Pages

Loop Through Pages

Send Request and Check Response

Parse HTML Using BeautifulSoup

Extract Data

Full Code

Summary

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping Multiple Pages with Python and BeautifulSoup

Prerequisites

Import Modules

Define Base URL

Specify Number of Pages

Loop Through Pages

Send Request and Check Response

Parse HTML Using BeautifulSoup

Extract Data

Full Code

Summary

The easiest way to do Web Scraping

Don't leave just yet!