Wikipedia is a gold mine containing extensive information on just about every topic imaginable. As one of the most popular websites globally, it contains structured data that can be incredibly useful for analysis if extracted properly. This is where web scraping comes in.

In this article, we will walk through a complete example of how to scrape data from Wikipedia pages using R. We will extract information on all the Presidents of the United States and print it out.

This is the table we are talking about

Here's a peek at the key things you'll learn:

Making HTTP requests to fetch webpages

Parsing HTML content using rvest

Extracting tables and data using XPath queries

Handling errors and edge cases with care

Structuring and working with scraped data

And much more! By the end, you'll have hands-on experience with the end-to-end process.

The best way to learn web scraping is by getting our hands dirty with some code. So without further ado, let's get scraping!

Step 1: Import Libraries

We will leverage a few handy R libraries that make scraping very easy:

library(httr) # for sending HTTP requests to get webpages
library(rvest) # for parsing and extracting HTML content
library(xml2) # for wrangling XML/HTML

Let's go through the purpose of each:

httr: Provides useful functions for creating and sending HTTP requests to fetch resources like HTML pages. We don't want to deal with HTTP at a low level, so this abstracts it away.

rvest: Built on top of httr, this provides very useful tools for parsing, selecting, and extracting content from HTML and XML documents fetched. Our best friend for scraping!

xml2: Useful for wrangling and processing XML/HTML documents once extracted.

So in a nutshell:

httr fetches the HTML page for us

rvest allows us to extract the exact pieces we want

xml2 helps us work with the extracted content

This combination is very powerful!

Step 2: Define the URL

We need to pass a URL into httr to actually fetch the webpage. Let's define the URL of the Wikipedia page we want to scrape:

url <- "<https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States>"

Specifically, we will be scraping the List of Presidents page which contains tables with plenty of structured data on all US presidents.

Step 3: Create a User-Agent Header

Websites can identify who is sending requests by checking the User-Agent - a text header that contains info about the software/client making the request.

To mimic a real browser:

headers <- c("User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")

This makes Wikipedia think a Chrome browser is accessing it. Doing this helps avoid blocks since Wikipedia doesn't like scraping bots!

Step 4: Send HTTP Request

Now we can fetch the page by sending a GET request:

response <- GET(url, add_headers(headers))

This will return an HTTP response object containing the status code, headers, and most importantly - the HTML content!

Step 5: Check if Request Succeeded

It's good practice to ensure the request was successful before trying to extract data.

if (http_status(response)$category == "success") {

  # Success! Extract data

} else {

  print("Failed to retrieve page. Status code:", http_status(response)$code)

}

We simply check if the status category was "success" (code 200). If not, we print the failure status code.

Step 6: Parse the HTML

Since the request succeeded, we can parse the HTML using read_html():

webpage <- read_html(response)

This parses the raw HTML into an xml document that rvest can now query!

Step 7: Extract the Table

On the Wikipedia page, all president data sits within a table marked by class="wikitable sortable".

Inspecting the page

When we inspect the page we can see that the table has a class called wikitable and sortable

We can use an XPath query to extract just this table node:

table <- html_node(webpage, xpath="//table[contains(@class, 'wikitable sortable')]")

This says - find the table tag with a class attribute containing "wikitable sortable".

Step 8: Extract Table Rows

We can now grab all nodes within this table as R data frames with:

rows <- html_nodes(table, "tr")

This selects all table row elements we want to extract.

Note: The first row contains the headers, so we'll want to skip that when extracting the data.

Step 9: Convert Table to Data Frame

To automatically parse the table into a data frame:

data <- html_table(html_node(table, "table"), fill = TRUE)[2:nrow(rows),]

This uses the html_table function to parse and return a data frame. We skip the first row by slicing from 2:end.

Step 10: Print Extracted Data

Finally, we can iterate through the rows and print president data:

for (i in 1:nrow(data)) {

  print("Number:", data[i, 1])
  print("Name:", data[i, 3])
  print("Term:", data[i, 4])
  # ...

}

And we have successfully scraped Wikipedia in R!

Full code:

library(httr)
library(rvest)
library(xml2)

# Define the URL of the Wikipedia page
url <- "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States"

# Define a user-agent header to simulate a browser request
headers <- c("User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")

# Send an HTTP GET request to the URL with the headers
response <- GET(url, add_headers(headers))

# Check if the request was successful (status code 200)
if (http_status(response)$category == "success") {
    # Parse the HTML content of the page
    webpage <- read_html(response)

    # Find the table with the specified class name
    table <- html_node(webpage, xpath = "//table[contains(@class, 'wikitable sortable')]")

    # Extract rows from the table, skipping the header row
    rows <- html_nodes(table, "tr")
    data <- html_table(html_node(table, "table"), fill = TRUE)[2:nrow(rows), ]

    # Print the scraped data for all presidents
    for (i in 1:nrow(data)) {
        cat("President Data:\n")
        cat("Number:", data[i, 1], "\n")
        cat("Name:", data[i, 3], "\n")
        cat("Term:", data[i, 4], "\n")
        cat("Party:", data[i, 6], "\n")
        cat("Election:", data[i, 7], "\n")
        cat("Vice President:", data[i, 8], "\n\n")
    }

} else {
    cat("Failed to retrieve the web page. Status code:", http_status(response)$code, "\n")
}

Key Takeaways

Let's recap what we learned:

Import libraries like rvest and httr for easy scraping

Construct the URL pointing to the Wikipedia page

Create a user-agent header to mimic a browser

Send a GET request with the URL and headers

Check if the request succeeded before parsing

Use XPath queries to extract specific HTML nodes

Convert HTML tables into data frames for structured data

Print and process the scraped data as needed

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Scraping Wikipedia Tables with R

Step 1: Import Libraries

Step 2: Define the URL

Step 3: Create a User-Agent Header

Step 4: Send HTTP Request

Step 5: Check if Request Succeeded

Step 6: Parse the HTML

Step 7: Extract the Table

Step 8: Extract Table Rows

Step 9: Convert Table to Data Frame

Step 10: Print Extracted Data

Key Takeaways

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping Wikipedia Tables with R

Step 1: Import Libraries

Step 2: Define the URL

Step 3: Create a User-Agent Header

Step 4: Send HTTP Request

Step 5: Check if Request Succeeded

Step 6: Parse the HTML

Step 7: Extract the Table

Step 8: Extract Table Rows

Step 9: Convert Table to Data Frame

Step 10: Print Extracted Data

Key Takeaways

The easiest way to do Web Scraping

Don't leave just yet!