Scraping Hacker News with Go

Jan 21, 2024 · 8 min read

In this article, we will be walking through Go code that scrapes articles from the Hacker News homepage and extracts key data points about each article. The goals are to:

  1. Explain how the code works from start to finish, focusing on areas that beginners typically find confusing
  2. Describe in detail how each data field is extracted from the HTML using Go's goquery library
  3. Provide the full code at the end as a runnable reference

This will teach you core concepts around web scraping with Go that you can apply to build scrapers for other sites.

This is the page we are talking about…

Prerequisites

To run this code, you just need:

  • Go installed and set up properly on your system
  • No other libraries need to be installed
  • With those in place, you can jump right into the explanations and example code below.

    Step-by-Step Walkthrough

    Below I will walk through exactly what this code does, section by section. The focus is on specifics around interacting with HTML and extracting data, since that tends to trip up Go beginners.

    Imports

    We start by importing the necessary packages:

    import (
      "fmt"
      "net/http"
      "github.com/PuerkitoBio/goquery"
      "strings"
    )
    
  • "fmt" is for printing output
  • "net/http" is used to make the HTTP requests
  • "github.com/PuerkitoBio/goquery" is the goquery library for parsing and querying HTML
  • "strings" is used for cleaning up some extracted text
  • Making the HTTP Request

    Next we define the URL we want to scrape and make the HTTP GET request:

    // Define the URL
    url := "<https://news.ycombinator.com/>"
    
    // Send HTTP GET request
    res, err := http.Get(url)
    
    // Handle errors
    if err != nil {
      fmt.Println("Failed to retrieve page:", err)
      return
    }
    
    defer res.Body.Close()
    

    We use Go's built-in http.Get() to send the request. This returns an http.Response that we can check for status codes and errors before proceeding.

    We also make sure to defer closing the response body to avoid resource leaks.

    Parsing the HTML with goquery

    Once we have the page content, we can parse it using goquery - Go's version of jQuery specifically for parsing HTML:

    if res.StatusCode == 200 {
    
      // Parse HTML using goquery
      doc, err := goquery.NewDocumentFromReader(res.Body)
    
      if err != nil {
        fmt.Println("Failed to parse HTML:", err)
        return
      }
    
      // Find all rows
      rows := doc.Find("tr")
    
      // ...rest of scraping logic
    }
    

    Assuming a 200 OK status code, we load the HTML into a goquery Document. This allows us to treat the HTML as jQuery/CSS selectors that we can query elements against.

    Inspecting the page

    You can notice that the items are housed inside a tag with the class athing

    For example, doc.Find("tr") grabs all table row elements from the page.

    Extracting Article Data

    Now this is where beginners tend to get lost - how to actually extract each data field from the HTML using goquery.

    I will walk through this section slowly, explaining how every piece of data is selected:

    // Initialize variables
    var currentArticle *goquery.Selection
    var currentRowType string
    
    // Iterate through rows
    rows.Each(func(i int, row *goquery.Selection) {
    
      // Get class and style attributes for identifying row types
      class, _ := row.Attr("class")
      style, _ := row.Attr("style")
    
      // Logic for identifying article vs detail rows
      if class == "athing" {
    
        currentArticle = row
        currentRowType = "article"
    
      } else if currentRowType == "article" {
    
        // This is a details row
    
        // Extract data only if currentArticle is set
        if currentArticle != nil {
    
          // -- Article title
          titleElem := currentArticle.Find("span.title")
          articleTitle := titleElem.Text()
    
          // -- Article URL
          articleURL, _ := titleElem.Find("a").Attr("href")
    
          // -- Points, author, comments
          subtext := row.Find("td.subtext")
          points := subtext.Find("span.score").Text()
          author := subtext.Find("a.hnuser").Text()
    
          commentsElem := subtext.Find("a:contains('comments')")
          comments := strings.TrimSpace(commentsElem.Text())
    
          // -- Timestamp
          timestamp := subtext.Find("span.age").AttrOr("title", "")
    
          // Print everything
          fmt.Println("Title:", articleTitle)
          fmt.Println("URL:", articleURL)
          fmt.Println("Points:", points)
          fmt.Println("Author:", author)
          fmt.Println("Timestamp:", timestamp)
          fmt.Println("Comments:", comments)
          fmt.Println(strings.Repeat("-", 50))
    
        }
    
      }
    
    })
    

    Let's break this down field-by-field:

    Article Title

    titleElem := currentArticle.Find("span.title")
    articleTitle := titleElem.Text()
    

    Article titles are inside elements within the article .

    We use .Find() to grab that element from currentArticle, then .Text() to extract the text content.

    Article URL

    articleURL, _ := titleElem.Find("a").Attr("href")
    

    The title text itself links to the article. So we find the anchor inside titleElem, and grab the href attribute which contains the URL.

    Points

    Points are inside a element in the subtext row. We grab and print its text.

    Author

    The author anchor has a class hnuser. We take the text of this element.

    Comments

    Interesting selector here! We look for any anchor that contains the text comments. This avoids having to hardcode a class. We then clean up whitespace and print.

    Timestamp

    The timestamp is stored in a title attribute rather than text. We use .AttrOr() to attempt to extract this, defaulting to empty string if missing.

    And that covers all the data extraction logic!

    The key ideas are:

  • Use row variables to iterate through articles
  • Identify article vs detail rows through classes
  • Use Find() and Text() to extract text elements
  • Use Attr() and AttrOr() to extract attributes
  • Refer to classes, element types, and text content to create unique selectors
  • With these building blocks, you can query elements in powerful ways.

    Full Code

    For reference, here is the complete code:

    package main
    
    import (
        "fmt"
        "net/http"
        "github.com/PuerkitoBio/goquery"
        "strings"
    )
    
    func main() {
        // Define the URL of the Hacker News homepage
        url := "https://news.ycombinator.com/"
    
        // Send a GET request to the URL
        res, err := http.Get(url)
        if err != nil {
            fmt.Println("Failed to retrieve the page:", err)
            return
        }
        defer res.Body.Close()
    
        // Check if the request was successful (status code 200)
        if res.StatusCode == 200 {
            // Parse the HTML content of the page using goquery
            doc, err := goquery.NewDocumentFromReader(res.Body)
            if err != nil {
                fmt.Println("Failed to parse HTML:", err)
                return
            }
    
            // Find all rows in the table
            rows := doc.Find("tr")
    
            // Initialize variables to keep track of the current article and row type
            var currentArticle *goquery.Selection
            var currentRowType string
    
            // Iterate through the rows to scrape articles
            rows.Each(func(i int, row *goquery.Selection) {
                class, _ := row.Attr("class")
                style, _ := row.Attr("style")
    
                if class == "athing" {
                    // This is an article row
                    currentArticle = row
                    currentRowType = "article"
                } else if currentRowType == "article" {
                    // This is the details row
                    if currentArticle != nil {
                        // Extract information from the current article and details row
                        titleElem := currentArticle.Find("span.title")
                        articleTitle := titleElem.Text()
                        articleURL, _ := titleElem.Find("a").Attr("href")
    
                        subtext := row.Find("td.subtext")
                        points := subtext.Find("span.score").Text()
                        author := subtext.Find("a.hnuser").Text()
                        timestamp := subtext.Find("span.age").AttrOr("title", "")
                        commentsElem := subtext.Find("a:contains('comments')")
                        comments := strings.TrimSpace(commentsElem.Text())
    
                        // Print the extracted information
                        fmt.Println("Title:", articleTitle)
                        fmt.Println("URL:", articleURL)
                        fmt.Println("Points:", points)
                        fmt.Println("Author:", author)
                        fmt.Println("Timestamp:", timestamp)
                        fmt.Println("Comments:", comments)
                        fmt.Println(strings.Repeat("-", 50))  // Separating articles
                    }
    
                    // Reset the current article and row type
                    currentArticle = nil
                    currentRowType = ""
                } else if style == "height:5px" {
                    // This is the spacer row, skip it
                    return
                }
            })
        } else {
            fmt.Println("Failed to retrieve the page. Status code:", res.StatusCode)
        }
    }

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!