Web Scraping Wikipedia Data in Go

Dec 6, 2023 · 6 min read

Web scraping is the process of automatically collecting structured data from websites. It can be useful for getting data off the web and into a format you can work with in applications. In this tutorial, we'll walk through scraping a Wikipedia table with Golang.

Specifically, we'll scrape the table listing all the presidents of the United States from this page:

https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States

This is the table we are talking about

Here's what our final program will do:

  • Send a request to fetch the Wikipedia page HTML
  • Parse the HTML using Go's goquery library to find the presidents table
  • Extract data like name, term, party for each president
  • Print out the structured data to the console
  • This provides a blueprint for scraping and structuring data from any web page with tables.

    Let's get started!

    First we import the packages we'll need:

    import (
        "fmt"
        "log"
        "net/http"
        "github.com/PuerkitoBio/goquery"
    )
    
  • fmt provides printing functions
  • log helps with logging errors
  • net/http sends HTTP requests
  • goquery parses and queries HTML documents
  • Next we'll define the URL of the Wikipedia page we want to scrape:

    url := "<https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States>"
    

    Now we need to make the HTTP request. Web servers can identify automated scrapers by the lack of headers. So we'll simulate a real browser's headers:

    headers := map[string]string{
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    }
    

    This will fool the server into thinking a real browser is making the request.

    We'll create a new HTTP client and build a GET request with our headers:

    client := &http.Client{}
    
    req, err := http.NewRequest("GET", url, nil)
    
    if err != nil {
        log.Fatal(err)
    }
    
    for key, value := range headers {
        req.Header.Set(key, value)
    }
    

    Now we can send the request and get the response:

    response, err := client.Do(req)
    
    if err != nil {
        log.Fatal(err)
    }
    
    defer response.Body.Close()
    

    We close the response body when done to prevent resource leaks.

    Let's check if the request succeeded with a 200 status code:

    if response.StatusCode == 200 {
        // Parsing logic here
    } else {
        fmt.Println("Failed to retrieve page")
    }
    

    If successful, we can parse the HTML using goquery, which allows jQuery-style element selection.

    First we load the response HTML into a goquery document:

    doc, err := goquery.NewDocumentFromReader(response.Body)
    if err != nil {
        log.Fatal(err)
    }
    

    We can now search for elements by class, id, tag name etc. Let's find the presidents table:

    Inspecting the page

    When we inspect the page we can see that the table has a class called wikitable and sortable

    table := doc.Find("table.wikitable.sortable")
    

    We initialize a slice to store our scraped data:

    data := [][]string{}
    

    Then we iterate through the table rows, skipping the header:

    table.Find("tr").Each(func(rowIdx int, row *goquery.Selection) {
    
        if rowIdx == 0 {
            return
        }
    
        // extract row data here
    
    })
    

    Inside this, we can extract and store each cell's text:

    rowData := []string{}
    
    row.Find("td").Each(func(colIdx int, col *goquery.Selection) {
        rowData = append(rowData, col.Text())
    })
    
    data = append(data, rowData)
    

    Finally, we can print out the scraped president data:

    for _, presidentData := range data {
    
        fmt.Println("Name: ", presidentData[2])
        fmt.Println("Term: ", presidentData[3])
        // etc
    
    }
    

    And we have a working Wikipedia scraper in Go!

    Some ideas for next steps:

  • Save scraped data to a CSV
  • Improve selector specificity
  • Add caching for faster repeat scrapes
  • Scrape concurrent pages with goroutines
  • Use proxies to avoid IP bans
  • Schedule periodic scrapes
  • package main
    
    import (
        "fmt"
        "log"
        "net/http"
        "github.com/PuerkitoBio/goquery"
    )
    
    func main() {
        // Define the URL of the Wikipedia page
        url := "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States"
    
        // Define a user-agent header to simulate a browser request
        headers := map[string]string{
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
        }
    
        // Create an HTTP client with custom headers
        client := &http.Client{}
        req, err := http.NewRequest("GET", url, nil)
        if err != nil {
            log.Fatal(err)
        }
        for key, value := range headers {
            req.Header.Set(key, value)
        }
    
        // Send an HTTP GET request to the URL with the headers
        response, err := client.Do(req)
        if err != nil {
            log.Fatal(err)
        }
        defer response.Body.Close()
    
        // Check if the request was successful (status code 200)
        if response.StatusCode == 200 {
            // Parse the HTML content of the page using goquery
            doc, err := goquery.NewDocumentFromReader(response.Body)
            if err != nil {
                log.Fatal(err)
            }
    
            // Find the table with the specified class name
            table := doc.Find("table.wikitable.sortable")
    
            // Initialize empty slice to store the table data
            data := [][]string{}
    
            // Iterate through the rows of the table
            table.Find("tr").Each(func(rowIdx int, row *goquery.Selection) {
                // Skip the header row
                if rowIdx == 0 {
                    return
                }
    
                // Extract data from each column and append it to the data slice
                rowData := []string{}
                row.Find("th,td").Each(func(colIdx int, col *goquery.Selection) {
                    rowData = append(rowData, col.Text())
                })
                data = append(data, rowData)
            })
    
            // Print the scraped data for all presidents
            for _, presidentData := range data {
                fmt.Println("President Data:")
                fmt.Println("Number:", presidentData[0])
                fmt.Println("Name:", presidentData[2])
                fmt.Println("Term:", presidentData[3])
                fmt.Println("Party:", presidentData[5])
                fmt.Println("Election:", presidentData[6])
                fmt.Println("Vice President:", presidentData[7])
                fmt.Println()
            }
        } else {
            fmt.Println("Failed to retrieve the web page. Status code:", response.StatusCode)
        }
    }

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!