Scraping Multiple Pages with Python and BeautifulSoup

Oct 15, 2023 · 5 min read

Web scraping is a useful technique for programmatically extracting data from websites. Often you need to scrape multiple pages from a site to gather complete information. In this article, we will see how to scrape multiple pages using Python and the BeautifulSoup library.

Prerequisites

To follow along, you'll need:

  • Basic knowledge of Python
  • Python 3 installed on your system
  • requests and beautifulsoup4 libraries installed:
  • pip install requests beautifulsoup4
    

    Import Modules

    We'll need the requests module to make HTTP requests to the pages and BeautifulSoup to parse the HTML:

    import requests
    from bs4 import BeautifulSoup
    

    Define Base URL

    We'll be scraping a blog - https://copyblogger.com/blog/. The page URLs follow a common pattern:

    <https://copyblogger.com/blog/>
    <https://copyblogger.com/blog/page/2/>
    <https://copyblogger.com/blog/page/3/>
    

    Let's define a base URL pattern:

    base_url = '<https://copyblogger.com/blog/page/{}/>'
    

    The {} will allow us to insert the page number.

    Specify Number of Pages

    Next, we'll specify how many pages we want to scrape. Let's scrape the first 5 pages:

    num_pages_to_scrape = 5
    

    Loop Through Pages

    We can now loop from 1 to num_pages_to_scrape and construct the URL for each page:

    for page_num in range(1, num_pages_to_scrape + 1):
    
      # Construct page URL
      url = base_url.format(page_num)
    
      # Code to scrape each page here
    
    

    Send Request and Check Response

    Inside the loop, we'll use requests.get() to send a GET request to the page URL.

    We'll check that the response status code is 200 to ensure the request succeeded:

    response = requests.get(url)
    
    if response.status_code == 200:
    
      # Scrape page
    
    else:
    
      print(f"Failed to retrieve page {page_num}")
    

    Parse HTML Using BeautifulSoup

    If the request succeeds, we can parse the HTML using BeautifulSoup:

    soup = BeautifulSoup(response.text, 'html.parser')
    

    This creates a BeautifulSoup object that we can use to extract data.

    Extract Data

    Now within the loop we can use soup to find and extract the desired data from each page.

    For example, to get all the article elements:

    articles = soup.find_all('article')
    

    We can loop through the articles and extract information like title, URL, author etc.

    Full Code

    Our full code to scrape 5 pages looks like:

    import requests
    from bs4 import BeautifulSoup
    
    base_url = '<https://copyblogger.com/blog/page/{}/>'
    num_pages_to_scrape = 5
    
    for page_num in range(1, num_pages_to_scrape + 1):
    
      url = base_url.format(page_num)
    
      response = requests.get(url)
    
      if response.status_code == 200:
    
        soup = BeautifulSoup(response.text, 'html.parser')
    
        articles = soup.find_all('article')
    
        for article in articles:
    
          # Extract data from article
    
          print(title)
          print(author)
    
      else:
        print(f"Failed to retrieve page {page_num}")
    

    This allows us to scrape and extract data from multiple pages sequentially. The full code can be extended to scrape any number of pages.

    Summary

  • Use a base URL pattern with {} placeholder
  • Loop through the pages with range()
  • Construct each page URL
  • Send GET request with requests and check response
  • Parse HTML with BeautifulSoup
  • Find and extract data inside the loop
  • Print or store scraped data
  • Web scraping enables collecting large datasets that can be analyzed programmatically. With the techniques covered here, you can scrape and extract information from multiple pages of a website in Python.

    # Updated full code
    
    import requests
    from bs4 import BeautifulSoup
    
    base_url = 'https://copyblogger.com/blog/page/{}/'
    num_pages_to_scrape = 5
    
    for page_num in range(1, num_pages_to_scrape + 1):
    
      url = base_url.format(page_num)
      
      response = requests.get(url)
      
      if response.status_code == 200:
      
        soup = BeautifulSoup(response.text, 'html.parser')
        
        articles = soup.find_all('article')
        
        for article in articles:
        
          # Extract the article title
          title = article.find('h2', class_='entry-title').text.strip()
    
          # Extract the article URL
          article_url = article.find('a', class_='entry-title-link')['href']
    
          # Extract the author's name
          author_name = article.find('div', class_='post-author').find('a').text.strip()
    
          # Find the categories container div
          categories_container = article.find('div', class_='entry-categories')
    
          # Extract the categories 
          if categories_container:
            categories = [cat.text.strip() for cat in categories_container.find_all('a')]
          else:
            categories = []
            
          # Print extracted information  
          print("Title:", title)
          print("URL:", article_url)
          print("Author:", author_name)
          print("Categories:", categories)
          print("\n")
          
      else:
        print(f"Failed to retrieve page {page_num}")

    While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

    Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

    This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

    With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!