Web Scraping New York Times News Headlines in Go

Web scraping is the process of automatically extracting data from websites using code. It allows you to harvest and analyze content from the web on a large scale. Go is a great language for writing web scrapers thanks to its fast performance and concise syntax.

In this tutorial, we'll walk through a simple Go program to scrape news article headlines and links from the New York Times homepage. Along the way, we'll learn web scraping concepts that apply to many projects regardless of language or site.

Overview

Here's what our scraper will do:

Send a GET request to retrieve the NYT homepage
Parse the HTML content
Use Go's goquery library to find all article containers
Extract the headline and link from each
Print the scraped data

Now let's break it down section-by-section!

Imports

We import three packages:

fmt - used to print output

net/http - make HTTP requests

goquery - HTML parsing

import (
    "fmt"
    "net/http"
    "github.com/PuerkitoBio/goquery"
)

Struct to Store Scraped Data

We define a struct called Article to map the data we want to scrape - the title and link:

type Article struct {
  Title string
  Link string
}

Main Function

The entry point of execution is the main() method:

func main() {

}

All web scraping logic will go inside here.

Constructing the HTTP Request

To make a request, we need a URL and user agent header:

url := "<https://www.nytimes.com/>"
userAgent := "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"

We use a real browser's user agent instead of Go to appear as a valid visitor.

Next, we create a client, build the GET request, and execute it:

client := &http.Client{}

req, err := http.NewRequest("GET", url , nil)
req.Header.Set("User-Agent", userAgent)

resp, err := client.Do(req)

We handle any errors and close the response body when done.

Parsing the HTML

The website content lives in resp.Body as raw HTML. We use goquery to parse it into a navigable structure:

doc, err := goquery.NewDocumentFromReader(resp.Body)

Inspecting the page

We now inspect element in chrome to see how the code is structured…

You can see that the articles are contained inside section tags and with the class story-wrapper

Extracting Data

goquery allows jQuery-style element selection and traversal. We find all article containers, loop through them, and pull what we need:

doc.Find("section.story-wrapper").Each(func(i int, s *goquery.Selection) {

  title := s.Find("h3.indicate-hover").Text()
  link := s.Find("a.css-9mylee").Attr("href")

  article := Article{Title: title, Link: link}
  articles = append(articles, article)

})

Here we use the class selectors of key elements to target titles and links.

Printing Output

Finally, we can print or store the scraped data:

for _, a := range articles {
  fmt.Println(a.Title)
  fmt.Println(a.Link)
}

And we have a working scraper!

Challenges You May Encounter

There are a few common issues to look out for with web scrapers:

HTTP errors from connectivity issues or being blocked

Changes to page layouts and selectors causing breaks

Going too fast and getting rate limited

Practical solutions include:

Robust error handling

Regularly re-testing scrapers

Implementing throttling mechanisms

With diligence, these can be overcome.

Next Steps

Some ideas for building on this:

Scrape additional data like subtitles, authors etc.

Save scraped data to file / database

Expand to scrape multiple sites

Generalize scrapers based on site templates

Schedule periodic scrapes

Web scraping is a learn-by-doing skill. Experiment with different sites and data points to grow!

Full Code

For reference, here is the complete code from the beginning:

// full code from above
package main

import (
    "fmt"
    "net/http"

    "github.com/PuerkitoBio/goquery"
)

// Article struct to store title and link 
type Article struct {
    Title string
    Link string
}

func main() {

    // URL and user agent
    url := "https://www.nytimes.com/"
    userAgent := "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"

    // Create HTTP client
    client := &http.Client{}

    // Build GET request
    req, err := http.NewRequest("GET", url, nil)
    if err != nil {
        panic(err)
    }

    // Set user agent 
    req.Header.Set("User-Agent", userAgent)

    // Send request
    resp, err := client.Do(req)
    if err != nil {
        panic(err)
    }
    defer resp.Body.Close()

    // Check status code  
    if resp.StatusCode != 200 {
        fmt.Println("Failed to retrieve web page. Status code:", resp.StatusCode)
        return
    }

    // Parse response HTML
    doc, err := goquery.NewDocumentFromReader(resp.Body)
    if err != nil {
        panic(err) 
    }

    // Find articles
    var articles []Article
    doc.Find("section.story-wrapper").Each(func(i int, s *goquery.Selection) {

        // Extract title and link
        title := s.Find("h3.indicate-hover").Text() 
        link, _ := s.Find("a.css-9mylee").Attr("href")

        // Create article
        article := Article{Title: title, Link: link}

        // Append article to results 
        articles = append(articles, article) 
    })

    // Print articles
    for _, a := range articles {
        fmt.Println(a.Title)
        fmt.Println(a.Link)  
    }

}

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Web Scraping New York Times News Headlines in Go

Overview

Imports

Struct to Store Scraped Data

Main Function

Constructing the HTTP Request

Parsing the HTML

Inspecting the page

Extracting Data

Printing Output

Challenges You May Encounter

Next Steps

Full Code

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Web Scraping New York Times News Headlines in Go

Overview

Imports

Struct to Store Scraped Data

Main Function

Constructing the HTTP Request

Parsing the HTML

Inspecting the page

Extracting Data

Printing Output

Challenges You May Encounter

Next Steps

Full Code

The easiest way to do Web Scraping

Don't leave just yet!