Extracting Structured Data by Scraping Wikipedia with Kotlin

Dec 6, 2023 · 6 min read

Wikipedia contains a wealth of valuable information spanning all topics imaginable. Much of this info is tucked away in tables, infoboxes, and other structured data on pages. As developers and data enthusiasts, having access to this data can enable all sorts of useful applications.

That's where web scraping comes into play. Web scraping refers to the technique of programmatically extracting data from websites using tools that parse and select information out of HTML. In this post, we'll walk through a hands-on example of scraping Wikipedia to pull together a dataset of US presidents.

This is the table we are talking about

Setting the Stage with Jsoup

For this tutorial we'll use Jsoup, a Java library designed for working with real-world HTML and scraping data out of it.

First we'll define some imports to pull in the Jsoup functionality:

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element

Jsoup contains tools for fetching web pages, traversing/selecting DOM elements, and extracting text or other data.

Making the Initial Request

Let's define the URL of the Wikipedia page we want to scrape:

val url = "<https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States>"

This page contains a large structured table with data on each American president.

To fetch the contents of that page, we use the Jsoup.connect() method which returns a Connection object. This handles making the GET request and retrieving the response:

val response = Jsoup.connect(url).get()

One thing I always recommend is setting a user agent header. This mimics a real web browser so the server knows the request didn't come from an unknown bot:

val headers = mapOf("User-Agent" to "Mozilla/5.0 ...")

Jsoup.connect(url).headers(headers).get()

Checking for Success

We can then check the status code of the response to confirm it succeeded:

if (response.statusCode() == 200) {
  // scraping logic here
} else {
  println("Failed, status code: ${response.statusCode()}")
}

Server responses are full of helpful metadata. Get familiar with status codes to better handle errors.

Parsing the Page with Jsoup

Now that we have the raw HTML, we need to parse it into a structured Document object that we can query:

val soup: Document = response

This Document represents the entire parsed page, modeled after the DOM tree structure.

Selecting Elements

Inspecting the page

When we inspect the page we can see that the table has a class called wikitable and sortable

We can use CSS-style selectors to locate specific elements in that document:

val table: Element? = soup.select("table.wikitable.sortable").first()

This selects the first table element with the classes "wikitable" and "sortable".

Jsoup selectors are very powerful - read more about selector syntax.

Extracting the Data

Finally, we can iterate through the rows and cells of this table to extract our president data:

table?.select("tr")?.drop(1)?.forEach { row ->

  val columns = row.select("td, th")

  val row_data = columns.map { col ->
    col.text()
  }

  println(row_data) // prints array of cell data
}

And there you have it - we now have the structured data scraped from Wikipedia! Jsoup handles all the heavy lifting.

Potential Challenges

There are always practical challenges with real-world scraping:

  • Handling cookies/sessions
  • Working around bot protections
  • Dealing with JavaScript rendering
  • Parsing complex page layouts
  • The example here sidesteps some of those issues, but they come up frequently. We have a whole separate guide on handling those scrapers dilemmas.

    Now that you have the foundations, you can scrape most semi-structured data off sites like Wikipedia. Always be responsible by adding throttling, caching, and obeying robots.txt.

    Check out the full code for this Wikipedia scraper below. Hopefully this gives you a template to start extracting your own datasets!

    import org.jsoup.Jsoup
    import org.jsoup.nodes.Document
    import org.jsoup.nodes.Element
    
    fun main() {
        // Define the URL of the Wikipedia page
        val url = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States"
    
        // Define a user-agent header to simulate a browser request
        val headers = mapOf("User-Agent" to "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")
    
        // Send an HTTP GET request to the URL with the headers
        val response = Jsoup.connect(url).headers(headers).get()
    
        // Check if the request was successful (status code 200)
        if (response.statusCode() == 200) {
            // Parse the HTML content of the page using Jsoup
            val soup: Document = response
    
            // Find the table with the specified class name
            val table: Element? = soup.select("table.wikitable.sortable").first()
    
            // Initialize empty lists to store the table data
            val data: MutableList<List<String>> = mutableListOf()
    
            // Iterate through the rows of the table
            table?.select("tr")?.drop(1)?.forEach { row ->
                val columns = row.select("td, th")
    
                // Extract data from each column and append it to the data list
                val row_data = columns.map { col -> col.text() }
                data.add(row_data)
            }
    
            // Print the scraped data for all presidents
            for (president_data in data) {
                println("President Data:")
                println("Number: ${president_data[0]}")
                println("Name: ${president_data[2]}")
                println("Term: ${president_data[3]}")
                println("Party: ${president_data[5]}")
                println("Election: ${president_data[6]}")
                println("Vice President: ${president_data[7]}")
                println()
            }
    
        } else {
            println("Failed to retrieve the web page. Status code: ${response.statusCode()}")
        }
    }

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!