In this article, we will be walking through Go code that scrapes articles from the Hacker News homepage and extracts key data points about each article. The goals are to:

Explain how the code works from start to finish, focusing on areas that beginners typically find confusing
Describe in detail how each data field is extracted from the HTML using Go's goquery library
Provide the full code at the end as a runnable reference

This will teach you core concepts around web scraping with Go that you can apply to build scrapers for other sites.

This is the page we are talking about…

Prerequisites

To run this code, you just need:

Go installed and set up properly on your system

No other libraries need to be installed

With those in place, you can jump right into the explanations and example code below.

Step-by-Step Walkthrough

Below I will walk through exactly what this code does, section by section. The focus is on specifics around interacting with HTML and extracting data, since that tends to trip up Go beginners.

Imports

We start by importing the necessary packages:

import (
  "fmt"
  "net/http"
  "github.com/PuerkitoBio/goquery"
  "strings"
)

"fmt" is for printing output

"net/http" is used to make the HTTP requests

"github.com/PuerkitoBio/goquery" is the goquery library for parsing and querying HTML

"strings" is used for cleaning up some extracted text

Making the HTTP Request

Next we define the URL we want to scrape and make the HTTP GET request:

// Define the URL
url := "<https://news.ycombinator.com/>"

// Send HTTP GET request
res, err := http.Get(url)

// Handle errors
if err != nil {
  fmt.Println("Failed to retrieve page:", err)
  return
}

defer res.Body.Close()

We use Go's built-in http.Get() to send the request. This returns an http.Response that we can check for status codes and errors before proceeding.

We also make sure to defer closing the response body to avoid resource leaks.

Parsing the HTML with goquery

Once we have the page content, we can parse it using goquery - Go's version of jQuery specifically for parsing HTML:

if res.StatusCode == 200 {

  // Parse HTML using goquery
  doc, err := goquery.NewDocumentFromReader(res.Body)

  if err != nil {
    fmt.Println("Failed to parse HTML:", err)
    return
  }

  // Find all rows
  rows := doc.Find("tr")

  // ...rest of scraping logic
}

Assuming a 200 OK status code, we load the HTML into a goquery Document. This allows us to treat the HTML as jQuery/CSS selectors that we can query elements against.

Inspecting the page

You can notice that the items are housed inside a tag with the class athing

For example, doc.Find("tr") grabs all table row elements from the page.

Extracting Article Data

Now this is where beginners tend to get lost - how to actually extract each data field from the HTML using goquery.

I will walk through this section slowly, explaining how every piece of data is selected:

// Initialize variables
var currentArticle *goquery.Selection
var currentRowType string

// Iterate through rows
rows.Each(func(i int, row *goquery.Selection) {

  // Get class and style attributes for identifying row types
  class, _ := row.Attr("class")
  style, _ := row.Attr("style")

  // Logic for identifying article vs detail rows
  if class == "athing" {

    currentArticle = row
    currentRowType = "article"

  } else if currentRowType == "article" {

    // This is a details row

    // Extract data only if currentArticle is set
    if currentArticle != nil {

      // -- Article title
      titleElem := currentArticle.Find("span.title")
      articleTitle := titleElem.Text()

      // -- Article URL
      articleURL, _ := titleElem.Find("a").Attr("href")

      // -- Points, author, comments
      subtext := row.Find("td.subtext")
      points := subtext.Find("span.score").Text()
      author := subtext.Find("a.hnuser").Text()

      commentsElem := subtext.Find("a:contains('comments')")
      comments := strings.TrimSpace(commentsElem.Text())

      // -- Timestamp
      timestamp := subtext.Find("span.age").AttrOr("title", "")

      // Print everything
      fmt.Println("Title:", articleTitle)
      fmt.Println("URL:", articleURL)
      fmt.Println("Points:", points)
      fmt.Println("Author:", author)
      fmt.Println("Timestamp:", timestamp)
      fmt.Println("Comments:", comments)
      fmt.Println(strings.Repeat("-", 50))

    }

  }

})

Let's break this down field-by-field:

Article Title

titleElem := currentArticle.Find("span.title")
articleTitle := titleElem.Text()

Article titles are inside elements within the article .

We use .Find() to grab that element from currentArticle, then .Text() to extract the text content.

Article URL

articleURL, _ := titleElem.Find("a").Attr("href")

The title text itself links to the article. So we find the anchor inside titleElem, and grab the href attribute which contains the URL.

Points

points := subtext.Find("span.score").Text()

Points are inside a element in the subtext row. We grab and print its text.

Author

author := subtext.Find("a.hnuser").Text()

The author anchor has a class hnuser. We take the text of this element.

Comments

commentsElem := subtext.Find("a:contains('comments')")
comments := strings.TrimSpace(commentsElem.Text())

Interesting selector here! We look for any anchor that contains the text comments. This avoids having to hardcode a class. We then clean up whitespace and print.

Timestamp

timestamp := subtext.Find("span.age").AttrOr("title", "")

The timestamp is stored in a title attribute rather than text. We use .AttrOr() to attempt to extract this, defaulting to empty string if missing.

And that covers all the data extraction logic!

The key ideas are:

Use row variables to iterate through articles

Identify article vs detail rows through classes

Use Find() and Text() to extract text elements

Use Attr() and AttrOr() to extract attributes

Refer to classes, element types, and text content to create unique selectors

With these building blocks, you can query elements in powerful ways.

Full Code

For reference, here is the complete code:

package main

import (
    "fmt"
    "net/http"
    "github.com/PuerkitoBio/goquery"
    "strings"
)

func main() {
    // Define the URL of the Hacker News homepage
    url := "https://news.ycombinator.com/"

    // Send a GET request to the URL
    res, err := http.Get(url)
    if err != nil {
        fmt.Println("Failed to retrieve the page:", err)
        return
    }
    defer res.Body.Close()

    // Check if the request was successful (status code 200)
    if res.StatusCode == 200 {
        // Parse the HTML content of the page using goquery
        doc, err := goquery.NewDocumentFromReader(res.Body)
        if err != nil {
            fmt.Println("Failed to parse HTML:", err)
            return
        }

        // Find all rows in the table
        rows := doc.Find("tr")

        // Initialize variables to keep track of the current article and row type
        var currentArticle *goquery.Selection
        var currentRowType string

        // Iterate through the rows to scrape articles
        rows.Each(func(i int, row *goquery.Selection) {
            class, _ := row.Attr("class")
            style, _ := row.Attr("style")

            if class == "athing" {
                // This is an article row
                currentArticle = row
                currentRowType = "article"
            } else if currentRowType == "article" {
                // This is the details row
                if currentArticle != nil {
                    // Extract information from the current article and details row
                    titleElem := currentArticle.Find("span.title")
                    articleTitle := titleElem.Text()
                    articleURL, _ := titleElem.Find("a").Attr("href")

                    subtext := row.Find("td.subtext")
                    points := subtext.Find("span.score").Text()
                    author := subtext.Find("a.hnuser").Text()
                    timestamp := subtext.Find("span.age").AttrOr("title", "")
                    commentsElem := subtext.Find("a:contains('comments')")
                    comments := strings.TrimSpace(commentsElem.Text())

                    // Print the extracted information
                    fmt.Println("Title:", articleTitle)
                    fmt.Println("URL:", articleURL)
                    fmt.Println("Points:", points)
                    fmt.Println("Author:", author)
                    fmt.Println("Timestamp:", timestamp)
                    fmt.Println("Comments:", comments)
                    fmt.Println(strings.Repeat("-", 50))  // Separating articles
                }

                // Reset the current article and row type
                currentArticle = nil
                currentRowType = ""
            } else if style == "height:5px" {
                // This is the spacer row, skip it
                return
            }
        })
    } else {
        fmt.Println("Failed to retrieve the page. Status code:", res.StatusCode)
    }
}

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"

We have a running offer of 1000 API calls completely free. Register and get your free API Key.

Scraping Hacker News with Go

Prerequisites

Step-by-Step Walkthrough

Imports

Making the HTTP Request

Parsing the HTML with goquery

Extracting Article Data

Article Title

Article URL

Points

Author

Comments

Timestamp

Full Code

Browse by language:

The easiest way to do Web Scraping

Scraping Hacker News with Go

Prerequisites

Step-by-Step Walkthrough

Imports

Making the HTTP Request

Parsing the HTML with goquery

Extracting Article Data

Article Title

Article URL

Points

Author

Comments

Timestamp

Full Code

The easiest way to do Web Scraping

Don't leave just yet!