Scraping Wikipedia in Java for Beginners

Dec 6, 2023 · 6 min read

Web scraping is the process of extracting data from websites. It can be useful for getting data that is not available through an API or that would take a long time to collect manually.

In this article, we'll walk through a full code example for scraping Wikipedia to get data on all the US presidents. Our use case will be to print out the number, name, term dates, party, election year, and vice president for each president.

This is the table we are talking about

Importing Jsoup

First we import the Jsoup Java library, which we'll use to connect to and parse content from the Wikipedia page:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

Jsoup handles a lot of the nitty gritty HTTP requests and HTML parsing for us. We just need to tell it which page to scrape.

Defining the URL

We define the Wikipedia URL we want to scrape. Specifically this is the page listing all US presidents:

String url = "<https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States>";

Setting a User Agent

Next we set a user agent header to simulate a real browser request. Many websites block scrapers so this makes our request look more legitimate:

String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36";

Getting the HTML Document

Now we use Jsoup to connect to the URL and get the HTML document. We pass the user agent we defined:

Document document = Jsoup.connect(url).userAgent(userAgent).get();

The document contains all the HTML from the Wikipedia page.

Extracting the Presidents Table

Next we want to extract the presidents table.

Inspecting the page

When we inspect the page we can see that the table has a class called wikitable and sortable

We use a CSS selector to find the table element with class "wikitable sortable":

Element table = document.select("table.wikitable.sortable").first();

We initialize an empty StringBuilder to hold the scraped data:

StringBuilder output = new StringBuilder();

Looping Through Table Rows

Now we loop through the rows of the table. We skip the first row since that is the header. For each row, we grab the data cells:

for (Element row : table.select("tr").subList(1, table.select("tr").size())) {

  Elements columns = row.select("td, th");

  // extract data from cells

}

Inside the loop, we extract the text from the cells we care about - the number, name, term, party, etc. We append labels and values to the output.

String number = columns.get(0).text();

output.append("Number: " + number + "\\n");

Printing the Scraped Data

After the loop, we print out the full scraped president data!

System.out.println(output.toString());

And that's it! We've now written a full Wikipedia scraper to extract president data.

Key Takeaways

  • Use a library like Jsoup to handle HTTP requests and HTML parsing
  • Inspect the page to find the right CSS selectors
  • Loop through data rows and extract info piece by piece
  • Print or store scraped data
  • You could extend this scraper to get more data, export the data to JSON/CSV, store it in a database, and more!

    Full code below:

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    
    import java.io.IOException;
    
    public class WikipediaScraper {
        public static void main(String[] args) {
            // Define the URL of the Wikipedia page
            String url = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States";
    
            try {
                // Define a user-agent header to simulate a browser request
                String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36";
    
                // Send an HTTP GET request to the URL with the headers
                Document document = Jsoup.connect(url).userAgent(userAgent).get();
    
                // Find the table with the specified class name
                Element table = document.select("table.wikitable.sortable").first();
    
                // Initialize empty lists to store the table data
                StringBuilder output = new StringBuilder();
    
                // Iterate through the rows of the table
                for (Element row : table.select("tr").subList(1, table.select("tr").size())) { // Skip the header row
                    Elements columns = row.select("td, th");
    
                    // Extract data from each column and append it to the output
                    String number = columns.get(0).text();
                    String name = columns.get(2).text();
                    String term = columns.get(3).text();
                    String party = columns.get(5).text();
                    String election = columns.get(6).text();
                    String vicePresident = columns.get(7).text();
    
                    output.append("President Data:\n");
                    output.append("Number: ").append(number).append("\n");
                    output.append("Name: ").append(name).append("\n");
                    output.append("Term: ").append(term).append("\n");
                    output.append("Party: ").append(party).append("\n");
                    output.append("Election: ").append(election).append("\n");
                    output.append("Vice President: ").append(vicePresident).append("\n\n");
                }
    
                // Print the scraped data for all presidents
                System.out.println(output.toString());
            } catch (IOException e) {
                System.err.println("Failed to retrieve the web page: " + e.getMessage());
            }
        }
    }

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!