Scraping Data from Wikipedia with Perl

Wikipedia contains a vast amount of structured data across millions of articles. Often, it can be useful to extract or "scrape" data from Wikipedia pages for use in other applications. In this article, I'll walk through a simple example of scraping tabular data from Wikipedia using Perl.

This is the table we are talking about

When Would You Want to Scrape Wikipedia Data?

A few examples where scraping Wikipedia data may be helpful:

Aggregating data from multiple Wikipedia articles into a structured dataset

Loading Wikipedia data into another application like a database or machine learning model

Creating visualizations or statistical analyses based on Wikipedia data

Monitoring Wikipedia for new data based on certain templates or infoboxes

Integrating dynamic Wikipedia data into an application

So in short - any use case where you want to utilize the structured data within Wikipedia pages.

Scraping the Wikipedia Presidents Table

To make things concrete, we'll walk through a full code example of scraping the List of Presidents of the United States table from Wikipedia.

This table contains data like president number, name, term dates, political party, etc. Scraping it will allow us to extract and utilize this data in other applications.

Import Perl Modules

We'll use a couple Perl modules to send HTTP requests and parse the returned HTML:

use LWP::UserAgent;
use HTML::TreeBuilder::XPath;

LWP::UserAgent - Sends HTTP requests to web servers

HTML::TreeBuilder::XPath - Parses HTML and allows querying DOM elements with XPath

Make sure these modules are installed to follow along.

Define Wikipedia URL

We need to pass the URL of the Wikipedia page we want to scrape to the request. We'll define it as:

my $url = "<https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States>";

Create a User Agent String

We'll also define a user agent header that mimics a real browser's user agent. This helps avoid blocked requests that some sites may impose on scrapers:

my $ua = LWP::UserAgent->new(
  agent => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
);

Send HTTP GET Request

We use the user agent to send a simple GET request to fetch the content of the Wikipedia URL:

my $response = $ua->get($url);

And we can check if the request succeeded with:

if ($response->is_success) {
  # Request succeeded, scrape content
} else {
  # Request failed, print error
  print "Request failed with status: " . $response->status_line . "\\n";
}

Parse Returned HTML

If the request succeeds, we have the HTML content of the page saved in $response. We can parse this using HTML::TreeBuilder::XPath:

my $content = $response->decoded_content;
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);

This gives us a DOM tree we can now query with XPath to find elements.

Locate Presidents Table

We want to extract the tabular data from the page.

Inspecting the page

When we inspect the page we can see that the table has a class called wikitable and sortable

We can use an XPath query to locate this table element:

my ($table) = $tree->findnodes('//table[@class="wikitable sortable"]');

Initialize Data Storage

Now that we've located the presidents table, we can loop through it and extract each row. We'll store the extracted data in an array of arrays:

my @data;

Each inner array will store a single president's data.

Loop Through Table Rows

We first find all the rows within the table:

my @rows = $table->findnodes('.//tr[position()>1]');

The XPath query skips the header row.

Then we iterate the rows:

for my $row (@rows) {

  # Extract and store data for this row

}

Extract Row Data

Within the row loop, we grab all the table cells with:

my @columns = $row->findnodes('.//td | .//th');

We can simplify the text from the cells:

my @row_data = map { $_->as_text =~ s/^\\s+|\\s+$//gr } @columns;

And append this row's data to the array:

push @data, \\@row_data;

So now @data contains an array of arrays, with each president's data!

Print Scraped Data

To confirm it worked, we can print out the data:

for my $president_data (@data) {

  print "Number: " . $president_data->[0] . "\\n";
  print "Name: " . $president_data->[2] . "\\n";

  # Print more fields...

}

And we have successfully scraped the Wikipedia table!

The full code is included again down below.

What's Next?

With the president data extracted, you could now:

Save it to a file or database

Analyze it with statistics

Feed it into a machine learning model

Visualize certain fields over time

Set up scripts to regularly scrape and monitor for changes

The possibilities are endless!

What other interesting Wikipedia data would be useful for you to scrape? Let me know in the comments!

Full Wikipedia Scraping Code

Here is the complete code example again for reference:

use LWP::UserAgent;
use HTML::TreeBuilder::XPath;

# Define the URL of the Wikipedia page
my $url = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States";

# Define a user-agent header to simulate a browser request
my $ua = LWP::UserAgent->new(
    agent => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
);

# Send an HTTP GET request to the URL with the headers
my $response = $ua->get($url);

# Check if the request was successful (status code 200)
if ($response->is_success) {
    my $content = $response->decoded_content;

    # Parse the HTML content of the page using HTML::TreeBuilder::XPath
    my $tree = HTML::TreeBuilder::XPath->new_from_content($content);

    # Find the table with the specified class name
    my ($table) = $tree->findnodes('//table[@class="wikitable sortable"]');

    # Initialize empty arrays to store the table data
    my @data;

    # Iterate through the rows of the table
    my @rows = $table->findnodes('.//tr[position()>1]');  # Skip the header row
    for my $row (@rows) {
        # Extract data from each column and append it to the data array
        my @columns = $row->findnodes('.//td | .//th');
        my @row_data = map { $_->as_text =~ s/^\s+|\s+$//gr } @columns;
        push @data, \@row_data;
    }

    # Print the scraped data for all presidents
    for my $president_data (@data) {
        print("President Data:\n");
        print("Number:", $president_data->[0], "\n");
        print("Name:", $president_data->[2], "\n");
        print("Term:", $president_data->[3], "\n");
        print("Party:", $president_data->[5], "\n");
        print("Election:", $president_data->[6], "\n");
        print("Vice President:", $president_data->[7], "\n");
        print("\n");
    }
} else {
    print("Failed to retrieve the web page. Status code:", $response->status_line, "\n");
}

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Scraping Data from Wikipedia with Perl

When Would You Want to Scrape Wikipedia Data?

Scraping the Wikipedia Presidents Table

Import Perl Modules

Define Wikipedia URL

Create a User Agent String

Send HTTP GET Request

Parse Returned HTML

Locate Presidents Table

Initialize Data Storage

Loop Through Table Rows

Extract Row Data

Print Scraped Data

What's Next?

Full Wikipedia Scraping Code

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping Data from Wikipedia with Perl

When Would You Want to Scrape Wikipedia Data?

Scraping the Wikipedia Presidents Table

Import Perl Modules

Define Wikipedia URL

Create a User Agent String

Send HTTP GET Request

Parse Returned HTML

Locate Presidents Table

Initialize Data Storage

Loop Through Table Rows

Extract Row Data

Print Scraped Data

What's Next?

Full Wikipedia Scraping Code

The easiest way to do Web Scraping

Don't leave just yet!