Web Scraping Wikipedia in Scala

Wikipedia is an incredibly useful source of structured data on almost any topic imaginable. Much of this data is presented in table format, making it perfect for scraping. Web scraping is the process of programmatically extracting data from websites. It allows you to harvest large amounts of data without tedious manual copying and pasting.

In this beginner's guide, we'll walk through a simple example of how to scrape Wikipedia using Scala and an HTML parsing library called Jsoup. Our goal will be to gather key data on all US presidents from Wikipedia and print it out.

This is the table we are talking about

The Key Steps

Here are the main steps we'll walk through:

Importing libraries
Defining the URL
Setting the user agent string
Sending the HTTP request
Parsing the HTML
Extracting the data
Printing the scraped data

It may sound daunting right now, but web scraping is accessible even to coding beginners with the right guidance. Let's get started!

1. Importing Libraries

We first need to import the libraries that we'll use for scraping and parsing:

import org.jsoup.Jsoup

import scala.collection.JavaConverters._

Jsoup handles connecting to web pages and parsing HTML. The JavaConverters package helps convert between Java and Scala collections since Jsoup uses Java collections under the hood.

2. Defining the URL

Next, we store the Wikipedia URL containing president data in a variable:

val url = "<https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States>"

Hard-coding the URL like this is OK for one-off scraping. For larger projects, consider loading URLs from a file or database.

3. Setting the User Agent

Websites can detect and block scraping bots. To mimic a real browser, we need to set a valid user agent header:

val userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"

This makes Wikipedia think we're Chrome running on Windows 10. User agents can be tricky to get right - we used a real-world Chrome one above.

4. Sending the HTTP Request

Now we can send an HTTP GET request for the URL. We pass the user agent in:

val doc = Jsoup.connect(url).userAgent(userAgent).get()

The get() sends the request and fetches the page HTML. This is saved in the doc variable for parsing next.

5. Parsing the HTML

Jsoup represents HTML documents as a nested Document Object Model (DOM) that we can query with CSS selectors, like jQuery.

Inspecting the page

When we inspect the page we can see that the table has a class called wikitable and sortable

Let's grab the presidents table:

val table = doc.select("table.wikitable.sortable").first()

This finds the

tag with matching CSS classes, and extracts just the first match.

💡 Tip: Be as specific as possible with selectors to avoid grabbing unnecessary data.

6. Extracting the Data

Now we can iterate through the table rows, get each cell, and extract the text:

for (row <- table.select("tr").asScala.drop(1)) {

  val columns = row.select("td,th").asScala.toList

  val row_data = columns.map(_.text())

  data += row_data
}

This skips the header row, gets the text from every cell in each subsequent row, and stores it in our data list for later use.

7. Printing the Scrapped Data

Finally, we can neatly print out all president data:

for (president_data <- data) {

  println("Name: " + president_data(2))
  println("Term: " + president_data(3))

  // print more attributes...

}

And we have successfully scraped structured data from a Wikipedia table!

The full runnable code example is listed again below for reference:

import org.jsoup.Jsoup
import scala.collection.JavaConverters._

object WikipediaScraper {
  def main(args: Array[String]): Unit = {
    // Define the URL of the Wikipedia page
    val url = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States"

    // Define a user-agent header to simulate a browser request
    val userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"

    // Send an HTTP GET request to the URL with the user-agent header
    val doc = Jsoup.connect(url).userAgent(userAgent).get()

    // Find the table with the specified class name
    val table = doc.select("table.wikitable.sortable").first()

    // Initialize empty lists to store the table data
    val data = scala.collection.mutable.ListBuffer.empty[List[String]]

    // Iterate through the rows of the table
    for (row <- table.select("tr").asScala.drop(1)) { // Skip the header row
      val columns = row.select("td,th").asScala.toList

      // Extract data from each column and append it to the data list
      val row_data = columns.map(_.text())
      data += row_data
    }

    // Print the scraped data for all presidents
    for (president_data <- data) {
      println("President Data:")
      println("Number:", president_data(0))
      println("Name:", president_data(2))
      println("Term:", president_data(3))
      println("Party:", president_data(5))
      println("Election:", president_data(6))
      println("Vice President:", president_data(7))
      println()
    }
  }
}

Key Takeaways

The key concepts we learned:

Web scraping can extract useful data from sites like Wikipedia

Use libraries like Jsoup to parse HTML

Mimic browsers with fake user agent strings

Parse HTML elements with CSS selectors

Iterate through elements to extract data

Data cleaning may be needed for best results

With just these basics, you can start scraping many useful sites. To level up:

Scrape across multiple pages

Store data in databases or files

Build GUIs around scrapers

Use advanced libraries like Selenium

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you

Try ProxiesAPI for free

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
...