Web Scraping Wikipedia Data in Go

Web scraping is the process of automatically collecting structured data from websites. It can be useful for getting data off the web and into a format you can work with in applications. In this tutorial, we'll walk through scraping a Wikipedia table with Golang.

Specifically, we'll scrape the table listing all the presidents of the United States from this page:

https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States

This is the table we are talking about

Here's what our final program will do:

Send a request to fetch the Wikipedia page HTML

Parse the HTML using Go's goquery library to find the presidents table

Extract data like name, term, party for each president

Print out the structured data to the console

This provides a blueprint for scraping and structuring data from any web page with tables.

Let's get started!

First we import the packages we'll need:

import (
    "fmt"
    "log"
    "net/http"
    "github.com/PuerkitoBio/goquery"
)

fmt provides printing functions

log helps with logging errors

net/http sends HTTP requests

goquery parses and queries HTML documents

Next we'll define the URL of the Wikipedia page we want to scrape:

url := "<https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States>"

Now we need to make the HTTP request. Web servers can identify automated scrapers by the lack of headers. So we'll simulate a real browser's headers:

headers := map[string]string{
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
}

This will fool the server into thinking a real browser is making the request.

We'll create a new HTTP client and build a GET request with our headers:

client := &http.Client{}

req, err := http.NewRequest("GET", url, nil)

if err != nil {
    log.Fatal(err)
}

for key, value := range headers {
    req.Header.Set(key, value)
}

Now we can send the request and get the response:

response, err := client.Do(req)

if err != nil {
    log.Fatal(err)
}

defer response.Body.Close()

We close the response body when done to prevent resource leaks.

Let's check if the request succeeded with a 200 status code:

if response.StatusCode == 200 {
    // Parsing logic here
} else {
    fmt.Println("Failed to retrieve page")
}

If successful, we can parse the HTML using goquery, which allows jQuery-style element selection.

First we load the response HTML into a goquery document:

doc, err := goquery.NewDocumentFromReader(response.Body)
if err != nil {
    log.Fatal(err)
}

We can now search for elements by class, id, tag name etc. Let's find the presidents table:

Inspecting the page

When we inspect the page we can see that the table has a class called wikitable and sortable

table := doc.Find("table.wikitable.sortable")

We initialize a slice to store our scraped data:

data := [][]string{}

Then we iterate through the table rows, skipping the header:

table.Find("tr").Each(func(rowIdx int, row *goquery.Selection) {

    if rowIdx == 0 {
        return
    }

    // extract row data here

})

Inside this, we can extract and store each cell's text:

rowData := []string{}

row.Find("td").Each(func(colIdx int, col *goquery.Selection) {
    rowData = append(rowData, col.Text())
})

data = append(data, rowData)

Finally, we can print out the scraped president data:

for _, presidentData := range data {

    fmt.Println("Name: ", presidentData[2])
    fmt.Println("Term: ", presidentData[3])
    // etc

}

And we have a working Wikipedia scraper in Go!

Some ideas for next steps:

Save scraped data to a CSV

Improve selector specificity

Add caching for faster repeat scrapes

Scrape concurrent pages with goroutines

Use proxies to avoid IP bans

Schedule periodic scrapes

package main

import (
    "fmt"
    "log"
    "net/http"
    "github.com/PuerkitoBio/goquery"
)

func main() {
    // Define the URL of the Wikipedia page
    url := "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States"

    // Define a user-agent header to simulate a browser request
    headers := map[string]string{
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    }

    // Create an HTTP client with custom headers
    client := &http.Client{}
    req, err := http.NewRequest("GET", url, nil)
    if err != nil {
        log.Fatal(err)
    }
    for key, value := range headers {
        req.Header.Set(key, value)
    }

    // Send an HTTP GET request to the URL with the headers
    response, err := client.Do(req)
    if err != nil {
        log.Fatal(err)
    }
    defer response.Body.Close()

    // Check if the request was successful (status code 200)
    if response.StatusCode == 200 {
        // Parse the HTML content of the page using goquery
        doc, err := goquery.NewDocumentFromReader(response.Body)
        if err != nil {
            log.Fatal(err)
        }

        // Find the table with the specified class name
        table := doc.Find("table.wikitable.sortable")

        // Initialize empty slice to store the table data
        data := [][]string{}

        // Iterate through the rows of the table
        table.Find("tr").Each(func(rowIdx int, row *goquery.Selection) {
            // Skip the header row
            if rowIdx == 0 {
                return
            }

            // Extract data from each column and append it to the data slice
            rowData := []string{}
            row.Find("th,td").Each(func(colIdx int, col *goquery.Selection) {
                rowData = append(rowData, col.Text())
            })
            data = append(data, rowData)
        })

        // Print the scraped data for all presidents
        for _, presidentData := range data {
            fmt.Println("President Data:")
            fmt.Println("Number:", presidentData[0])
            fmt.Println("Name:", presidentData[2])
            fmt.Println("Term:", presidentData[3])
            fmt.Println("Party:", presidentData[5])
            fmt.Println("Election:", presidentData[6])
            fmt.Println("Vice President:", presidentData[7])
            fmt.Println()
        }
    } else {
        fmt.Println("Failed to retrieve the web page. Status code:", response.StatusCode)
    }
}

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Web Scraping Wikipedia Data in Go

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Web Scraping Wikipedia Data in Go

The easiest way to do Web Scraping

Don't leave just yet!