Scraping Data from Wikipedia with PHP

Dec 6, 2023 · 7 min read

Web scraping is the process of extracting data from websites automatically. It can be useful for getting data off the web and into a format you can analyze or use programmatically.

In this article, we'll walk through an example of scraping Wikipedia to get data on all the Presidents of the United States.

Why Scrape Wikipedia?

Wikipedia contains structured data in tables that cover an incredibly wide range of topics. Scraping Wikipedia can be useful for research projects, data analysis, aggregating facts for quizzes or games, and more. The data is free to use and constantly updated by the Wikipedia community.

For our example, we'll scrape the table on this page to get data on each president like their name, term start and end dates, party, etc.

This is the table we are talking about

Prerequisites

To follow along, you'll need:

  • PHP installed on your machine
  • cURL enabled in PHP (it is by default)
  • We'll also use some PHP libraries like DOMDocument and DOMXPath to parse the HTML.

    Note: If you don't already have a development environment setup, the easiest way is to use a package like XAMPP which includes PHP, Apache server, and everything needed to run PHP scripts on your local computer.

    Scraping the President Data

    Let's walk through the script line-by-line to understand how it works:

    Define the URL

    We start by defining the URL of the Wikipedia page we want to scrape:

    $url = "<https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States>";
    

    Set a User-Agent Header

    Many sites try to detect and block scraping bots, so we simulate a real browser request by setting a user-agent header:

    $headers = [
        "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)..."
    ];
    

    This makes Wikipedia think a real person is accessing the page from Chrome browser on Windows.

    Pro Tip: You can get real user agent strings from your browser's developer tools network tab.

    Initialize cURL

    Next we initialize a cURL session, passing in the URL to fetch:

    $ch = curl_init($url);
    

    cURL will make the request and retrieve the content.

    Set cURL Options

    We configure some options to get the content of the page returned as a string:

    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
    

    The user agent header is also passed here.

    Send Request & Get Response

    To execute the request and get the response, we simply run:

    $response = curl_exec($ch);
    

    Then we can check that it was successful:

    if (curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200) {
      // parsing logic here...
    }
    

    A 200 status code means everything went well.

    Parse Response HTML

    With the HTML content, we can now parse out the data we want.

    First we load it into a DOMDocument object:

    $dom = new DOMDocument();
    @$dom->loadHTML($response);
    

    The @ symbol suppresses any warnings about invalid HTML. Wikipedia pages generally have valid XHTML, but the @ keeps things clean.

    Insider Tip: Using DOMDocument to parse HTML allows accessing elements with DOM methods like getElementById, querySelector, etc. It's very powerful for scraping!

    We then use XPath to select the specific table we want - the one with class wikitable sortable:

    $xpath = new DOMXPath($dom);
    $table = $xpath->query('//table[@class="wikitable sortable"]')->item(0);
    

    This grabs the first matching table element.

    Extract Table Data

    With the table node, we can loop through rows and cells to save the data:

    $rows = $table->getElementsByTagName('tr');
    
    foreach ($rows as $row) {
      $columns = $row->getElementsByTagName('td');
    
      // save cell data
      $data[] = [$column1_text, $column2_text, ...];
    }
    

    We add all rows to the $data array, ending up with a nice 2D array containing all presidents' data!

    Output Scraped Data

    Finally, we can print the scraped info or save it to CSV, JSON, etc:

        // Print the scraped data for all presidents
        foreach ($data as $presidentData) {
            echo "President Data:\n";
            echo "Number: " . $presidentData[0] . "\n";
            echo "Name: " . $presidentData[2] . "\n";
            echo "Term: " . $presidentData[3] . "\n";
            echo "Party: " . $presidentData[5] . "\n";
            echo "Election: " . $presidentData[6] . "\n";
            echo "Vice President: " . $presidentData[7] . "\n\n";
        }
    

    And we've successfully scraped the Wikipedia table!

    Full Script

    Here is the full script putting all the pieces together:

    <?php
    // Define the URL of the Wikipedia page
    $url = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States";
    
    // Define a user-agent header to simulate a browser request
    $headers = [
        "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    ];
    
    // Initialize cURL session
    $ch = curl_init($url);
    
    // Set cURL options
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
    
    // Send the HTTP GET request
    $response = curl_exec($ch);
    
    // Check if the request was successful (status code 200)
    if (curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200) {
        // Create a DOMDocument object and load the HTML content
        $dom = new DOMDocument();
        @$dom->loadHTML($response); // '@' to suppress HTML parsing warnings
    
        // Find the table with the specified class name
        $xpath = new DOMXPath($dom);
        $table = $xpath->query('//table[@class="wikitable sortable"]')->item(0);
    
        // Initialize empty arrays to store the table data
        $data = [];
    
        // Iterate through the rows of the table
        $rows = $table->getElementsByTagName('tr');
        foreach ($rows as $row) {
            $columns = $row->getElementsByTagName('td');
            $headerColumns = $row->getElementsByTagName('th');
            $rowData = [];
    
            foreach ($headerColumns as $col) {
                $rowData[] = $col->textContent;
            }
    
            foreach ($columns as $col) {
                $rowData[] = $col->textContent;
            }
    
            if (!empty($rowData)) {
                $data[] = $rowData;
            }
        }
    
        // Print the scraped data for all presidents
        foreach ($data as $presidentData) {
            echo "President Data:\n";
            echo "Number: " . $presidentData[0] . "\n";
            echo "Name: " . $presidentData[2] . "\n";
            echo "Term: " . $presidentData[3] . "\n";
            echo "Party: " . $presidentData[5] . "\n";
            echo "Election: " . $presidentData[6] . "\n";
            echo "Vice President: " . $presidentData[7] . "\n\n";
        }
    } else {
        echo "Failed to retrieve the web page. Status code: " . curl_getinfo($ch, CURLINFO_HTTP_CODE) . "\n";
    }
    
    // Close cURL session
    curl_close($ch);
    ?>
    

    Challenges & Next Steps

    Some challenges you may run into:

  • Sites blocking scraping with captchas or IP bans
  • Complex HTML and CSS selectors needed to extract data
  • Saving data in useful formats like JSON or CSV
  • This example just scratches the surface of web scraping in PHP. Some ideas for next steps:

  • Scrape data from other Wikipedia templates and tables
  • Build a script to aggregate facts or definitions for a game or research
  • Automatically save scraped data to a database or API
  • Write wrappers around cURL to simplify scraping flows
  • In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!