Scraping Wikipedia Tables with R

Dec 6, 2023 · 7 min read

Wikipedia is a gold mine containing extensive information on just about every topic imaginable. As one of the most popular websites globally, it contains structured data that can be incredibly useful for analysis if extracted properly. This is where web scraping comes in.

In this article, we will walk through a complete example of how to scrape data from Wikipedia pages using R. We will extract information on all the Presidents of the United States and print it out.

This is the table we are talking about

Here's a peek at the key things you'll learn:

  • Making HTTP requests to fetch webpages
  • Parsing HTML content using rvest
  • Extracting tables and data using XPath queries
  • Handling errors and edge cases with care
  • Structuring and working with scraped data
  • And much more! By the end, you'll have hands-on experience with the end-to-end process.

    The best way to learn web scraping is by getting our hands dirty with some code. So without further ado, let's get scraping!

    Step 1: Import Libraries

    We will leverage a few handy R libraries that make scraping very easy:

    library(httr) # for sending HTTP requests to get webpages
    library(rvest) # for parsing and extracting HTML content
    library(xml2) # for wrangling XML/HTML
    

    Let's go through the purpose of each:

    httr: Provides useful functions for creating and sending HTTP requests to fetch resources like HTML pages. We don't want to deal with HTTP at a low level, so this abstracts it away.

    rvest: Built on top of httr, this provides very useful tools for parsing, selecting, and extracting content from HTML and XML documents fetched. Our best friend for scraping!

    xml2: Useful for wrangling and processing XML/HTML documents once extracted.

    So in a nutshell:

  • httr fetches the HTML page for us
  • rvest allows us to extract the exact pieces we want
  • xml2 helps us work with the extracted content
  • This combination is very powerful!

    Step 2: Define the URL

    We need to pass a URL into httr to actually fetch the webpage. Let's define the URL of the Wikipedia page we want to scrape:

    url <- "<https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States>"
    

    Specifically, we will be scraping the List of Presidents page which contains tables with plenty of structured data on all US presidents.

    Step 3: Create a User-Agent Header

    Websites can identify who is sending requests by checking the User-Agent - a text header that contains info about the software/client making the request.

    To mimic a real browser:

    headers <- c("User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")
    

    This makes Wikipedia think a Chrome browser is accessing it. Doing this helps avoid blocks since Wikipedia doesn't like scraping bots!

    Step 4: Send HTTP Request

    Now we can fetch the page by sending a GET request:

    response <- GET(url, add_headers(headers))
    

    This will return an HTTP response object containing the status code, headers, and most importantly - the HTML content!

    Step 5: Check if Request Succeeded

    It's good practice to ensure the request was successful before trying to extract data.

    if (http_status(response)$category == "success") {
    
      # Success! Extract data
    
    } else {
    
      print("Failed to retrieve page. Status code:", http_status(response)$code)
    
    }
    

    We simply check if the status category was "success" (code 200). If not, we print the failure status code.

    Step 6: Parse the HTML

    Since the request succeeded, we can parse the HTML using read_html():

    webpage <- read_html(response)
    

    This parses the raw HTML into an xml document that rvest can now query!

    Step 7: Extract the Table

    On the Wikipedia page, all president data sits within a table marked by class="wikitable sortable".

    Inspecting the page

    When we inspect the page we can see that the table has a class called wikitable and sortable

    We can use an XPath query to extract just this table node:

    table <- html_node(webpage, xpath="//table[contains(@class, 'wikitable sortable')]")
    

    This says - find the table tag with a class attribute containing "wikitable sortable".

    Step 8: Extract Table Rows

    We can now grab all nodes within this table as R data frames with:

    rows <- html_nodes(table, "tr")
    

    This selects all table row elements we want to extract.

    Note: The first row contains the headers, so we'll want to skip that when extracting the data.

    Step 9: Convert Table to Data Frame

    To automatically parse the table into a data frame:

    data <- html_table(html_node(table, "table"), fill = TRUE)[2:nrow(rows),]
    

    This uses the html_table function to parse and return a data frame. We skip the first row by slicing from 2:end.

    Step 10: Print Extracted Data

    Finally, we can iterate through the rows and print president data:

    for (i in 1:nrow(data)) {
    
      print("Number:", data[i, 1])
      print("Name:", data[i, 3])
      print("Term:", data[i, 4])
      # ...
    
    }
    

    And we have successfully scraped Wikipedia in R!

    Full code:

    library(httr)
    library(rvest)
    library(xml2)
    
    # Define the URL of the Wikipedia page
    url <- "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States"
    
    # Define a user-agent header to simulate a browser request
    headers <- c("User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")
    
    # Send an HTTP GET request to the URL with the headers
    response <- GET(url, add_headers(headers))
    
    # Check if the request was successful (status code 200)
    if (http_status(response)$category == "success") {
        # Parse the HTML content of the page
        webpage <- read_html(response)
    
        # Find the table with the specified class name
        table <- html_node(webpage, xpath = "//table[contains(@class, 'wikitable sortable')]")
    
        # Extract rows from the table, skipping the header row
        rows <- html_nodes(table, "tr")
        data <- html_table(html_node(table, "table"), fill = TRUE)[2:nrow(rows), ]
    
        # Print the scraped data for all presidents
        for (i in 1:nrow(data)) {
            cat("President Data:\n")
            cat("Number:", data[i, 1], "\n")
            cat("Name:", data[i, 3], "\n")
            cat("Term:", data[i, 4], "\n")
            cat("Party:", data[i, 6], "\n")
            cat("Election:", data[i, 7], "\n")
            cat("Vice President:", data[i, 8], "\n\n")
        }
    
    } else {
        cat("Failed to retrieve the web page. Status code:", http_status(response)$code, "\n")
    }

    Key Takeaways

    Let's recap what we learned:

  • Import libraries like rvest and httr for easy scraping
  • Construct the URL pointing to the Wikipedia page
  • Create a user-agent header to mimic a browser
  • Send a GET request with the URL and headers
  • Check if the request succeeded before parsing
  • Use XPath queries to extract specific HTML nodes
  • Convert HTML tables into data frames for structured data
  • Print and process the scraped data as needed
  • In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!