Scraping Data from Wikipedia with Perl

Dec 6, 2023 · 7 min read

Wikipedia contains a vast amount of structured data across millions of articles. Often, it can be useful to extract or "scrape" data from Wikipedia pages for use in other applications. In this article, I'll walk through a simple example of scraping tabular data from Wikipedia using Perl.

This is the table we are talking about

When Would You Want to Scrape Wikipedia Data?

A few examples where scraping Wikipedia data may be helpful:

  • Aggregating data from multiple Wikipedia articles into a structured dataset
  • Loading Wikipedia data into another application like a database or machine learning model
  • Creating visualizations or statistical analyses based on Wikipedia data
  • Monitoring Wikipedia for new data based on certain templates or infoboxes
  • Integrating dynamic Wikipedia data into an application
  • So in short - any use case where you want to utilize the structured data within Wikipedia pages.

    Scraping the Wikipedia Presidents Table

    To make things concrete, we'll walk through a full code example of scraping the List of Presidents of the United States table from Wikipedia.

    This table contains data like president number, name, term dates, political party, etc. Scraping it will allow us to extract and utilize this data in other applications.

    Import Perl Modules

    We'll use a couple Perl modules to send HTTP requests and parse the returned HTML:

    use LWP::UserAgent;
    use HTML::TreeBuilder::XPath;
    
  • LWP::UserAgent - Sends HTTP requests to web servers
  • HTML::TreeBuilder::XPath - Parses HTML and allows querying DOM elements with XPath
  • Make sure these modules are installed to follow along.

    Define Wikipedia URL

    We need to pass the URL of the Wikipedia page we want to scrape to the request. We'll define it as:

    my $url = "<https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States>";
    

    Create a User Agent String

    We'll also define a user agent header that mimics a real browser's user agent. This helps avoid blocked requests that some sites may impose on scrapers:

    my $ua = LWP::UserAgent->new(
      agent => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    );
    

    Send HTTP GET Request

    We use the user agent to send a simple GET request to fetch the content of the Wikipedia URL:

    my $response = $ua->get($url);
    

    And we can check if the request succeeded with:

    if ($response->is_success) {
      # Request succeeded, scrape content
    } else {
      # Request failed, print error
      print "Request failed with status: " . $response->status_line . "\\n";
    }
    

    Parse Returned HTML

    If the request succeeds, we have the HTML content of the page saved in $response. We can parse this using HTML::TreeBuilder::XPath:

    my $content = $response->decoded_content;
    my $tree = HTML::TreeBuilder::XPath->new_from_content($content);
    

    This gives us a DOM tree we can now query with XPath to find elements.

    Locate Presidents Table

    We want to extract the tabular data from the page.

    Inspecting the page

    When we inspect the page we can see that the table has a class called wikitable and sortable

    We can use an XPath query to locate this table element:

    my ($table) = $tree->findnodes('//table[@class="wikitable sortable"]');
    

    Initialize Data Storage

    Now that we've located the presidents table, we can loop through it and extract each row. We'll store the extracted data in an array of arrays:

    my @data;
    

    Each inner array will store a single president's data.

    Loop Through Table Rows

    We first find all the rows within the table:

    my @rows = $table->findnodes('.//tr[position()>1]');
    

    The XPath query skips the header row.

    Then we iterate the rows:

    for my $row (@rows) {
    
      # Extract and store data for this row
    
    }
    

    Extract Row Data

    Within the row loop, we grab all the table cells with:

    my @columns = $row->findnodes('.//td | .//th');
    

    We can simplify the text from the cells:

    my @row_data = map { $_->as_text =~ s/^\\s+|\\s+$//gr } @columns;
    

    And append this row's data to the array:

    push @data, \\@row_data;
    

    So now @data contains an array of arrays, with each president's data!

    Print Scraped Data

    To confirm it worked, we can print out the data:

    for my $president_data (@data) {
    
      print "Number: " . $president_data->[0] . "\\n";
      print "Name: " . $president_data->[2] . "\\n";
    
      # Print more fields...
    
    }
    

    And we have successfully scraped the Wikipedia table!

    The full code is included again down below.

    What's Next?

    With the president data extracted, you could now:

  • Save it to a file or database
  • Analyze it with statistics
  • Feed it into a machine learning model
  • Visualize certain fields over time
  • Set up scripts to regularly scrape and monitor for changes
  • The possibilities are endless!

    What other interesting Wikipedia data would be useful for you to scrape? Let me know in the comments!

    Full Wikipedia Scraping Code

    Here is the complete code example again for reference:

    use LWP::UserAgent;
    use HTML::TreeBuilder::XPath;
    
    # Define the URL of the Wikipedia page
    my $url = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States";
    
    # Define a user-agent header to simulate a browser request
    my $ua = LWP::UserAgent->new(
        agent => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    );
    
    # Send an HTTP GET request to the URL with the headers
    my $response = $ua->get($url);
    
    # Check if the request was successful (status code 200)
    if ($response->is_success) {
        my $content = $response->decoded_content;
    
        # Parse the HTML content of the page using HTML::TreeBuilder::XPath
        my $tree = HTML::TreeBuilder::XPath->new_from_content($content);
    
        # Find the table with the specified class name
        my ($table) = $tree->findnodes('//table[@class="wikitable sortable"]');
    
        # Initialize empty arrays to store the table data
        my @data;
    
        # Iterate through the rows of the table
        my @rows = $table->findnodes('.//tr[position()>1]');  # Skip the header row
        for my $row (@rows) {
            # Extract data from each column and append it to the data array
            my @columns = $row->findnodes('.//td | .//th');
            my @row_data = map { $_->as_text =~ s/^\s+|\s+$//gr } @columns;
            push @data, \@row_data;
        }
    
        # Print the scraped data for all presidents
        for my $president_data (@data) {
            print("President Data:\n");
            print("Number:", $president_data->[0], "\n");
            print("Name:", $president_data->[2], "\n");
            print("Term:", $president_data->[3], "\n");
            print("Party:", $president_data->[5], "\n");
            print("Election:", $president_data->[6], "\n");
            print("Vice President:", $president_data->[7], "\n");
            print("\n");
        }
    } else {
        print("Failed to retrieve the web page. Status code:", $response->status_line, "\n");
    }

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: