Scraping Multiple Pages in Java with JSoup

Oct 15, 2023 · 4 min read

Web scraping is useful to programmatically extract data from websites. Often you need to scrape multiple pages from a site to gather complete information. In this article, we'll see how to scrape multiple pages in Java using the JSoup library.

Prerequisites

To follow along, you'll need:

  • Basic Java knowledge
  • Java and Maven installed
  • JSoup dependency added:
  • <dependency>
      <groupId>org.jsoup</groupId>
      <artifactId>jsoup</artifactId>
      <version>1.13.1</version>
    </dependency>
    

    Import JSoup

    We'll need the following import:

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    

    Define Base URL

    We'll scrape a blog - https://copyblogger.com/blog/. The page URLs follow a pattern:

    <https://copyblogger.com/blog/>
    <https://copyblogger.com/blog/page/2/>
    <https://copyblogger.com/blog/page/3/>
    

    Let's define the base URL pattern:

    String baseUrl = "<https://copyblogger.com/blog/page/%d/>";
    

    The %d allows us to insert the page number.

    Specify Number of Pages

    Next, we'll specify how many pages to scrape. Let's scrape the first 5 pages:

    int numPages = 5;
    

    Loop Through Pages

    We can now loop from 1 to numPages and construct the URL for each page:

    for (int page = 1; page <= numPages; page++) {
    
      // Construct page URL
      String url = String.format(baseUrl, page);
    
      // Code to scrape each page
    
    }
    

    Send Request and Parse HTML

    Inside the loop, we'll send a GET request and parse the HTML using JSoup:

    Document doc = Jsoup.connect(url).get();
    
    Elements articles = doc.select("article");
    

    This gives us Elements containing the article nodes to extract data from.

    Extract Data

    Now within the loop we can extract information like title, URL, author etc from each article:

    String title = article.select("h2.entry-title").text();
    String url = article.select("a.entry-title-link").attr("href");
    String author = article.select("div.post-author a").text();
    

    Full Code

    Our full code to scrape 5 pages is:

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    
    import java.io.IOException;
    import java.util.ArrayList;
    
    public class WebScraper {
    
      public static void main(String[] args) throws IOException {
    
        String baseUrl = "https://copyblogger.com/blog/page/%d/";
        int numPages = 5;
    
        for (int page = 1; page <= numPages; page++) {
    
          String url = String.format(baseUrl, page);
    
          Document doc = Jsoup.connect(url).get();
    
          Elements articles = doc.select("article");
    
          for (Element article : articles) {
    
            String title = article.select("h2.entry-title").text();
            String url = article.select("a.entry-title-link").attr("href");
            String author = article.select("div.post-author a").text();
            
            Elements categories = article.select("div.entry-categories a");
            ArrayList<String> catTexts = new ArrayList<>();
            for (Element cat : categories) {
              catTexts.add(cat.text());
            }
    
            System.out.println("Title: " + title);
            System.out.println("URL: " + url);
            System.out.println("Author: " + author);
            System.out.println("Categories: " + catTexts);
    
          }
    
        }
    
      }
      
    }

    This allows us to scrape and extract data from multiple pages sequentially in Java using JSoup. The code can be extended to scrape any number of pages.

    Summary

  • Use a base URL pattern with %d placeholder
  • Loop through pages with for loop
  • Construct each page URL
  • Send request and parse HTML with JSoup
  • Extract data using selectors
  • Print or store scraped data
  • Web scraping enables collecting large datasets programmatically. With the techniques here, you can scrape and extract information from multiple pages of a website in Java.

    While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

    Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

    This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

    With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: