Web Scraping New York Times News Headlines in Go

Dec 6, 2023 · 6 min read

Web scraping is the process of automatically extracting data from websites using code. It allows you to harvest and analyze content from the web on a large scale. Go is a great language for writing web scrapers thanks to its fast performance and concise syntax.

In this tutorial, we'll walk through a simple Go program to scrape news article headlines and links from the New York Times homepage. Along the way, we'll learn web scraping concepts that apply to many projects regardless of language or site.

Overview

Here's what our scraper will do:

  1. Send a GET request to retrieve the NYT homepage
  2. Parse the HTML content
  3. Use Go's goquery library to find all article containers
  4. Extract the headline and link from each
  5. Print the scraped data

Now let's break it down section-by-section!

Imports

We import three packages:

  • fmt - used to print output
  • net/http - make HTTP requests
  • goquery - HTML parsing
  • import (
        "fmt"
        "net/http"
        "github.com/PuerkitoBio/goquery"
    )
    

    Struct to Store Scraped Data

    We define a struct called Article to map the data we want to scrape - the title and link:

    type Article struct {
      Title string
      Link string
    }
    

    Main Function

    The entry point of execution is the main() method:

    func main() {
    
    }
    

    All web scraping logic will go inside here.

    Constructing the HTTP Request

    To make a request, we need a URL and user agent header:

    url := "<https://www.nytimes.com/>"
    userAgent := "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    
    

    We use a real browser's user agent instead of Go to appear as a valid visitor.

    Next, we create a client, build the GET request, and execute it:

    client := &http.Client{}
    
    req, err := http.NewRequest("GET", url , nil)
    req.Header.Set("User-Agent", userAgent)
    
    resp, err := client.Do(req)
    

    We handle any errors and close the response body when done.

    Parsing the HTML

    The website content lives in resp.Body as raw HTML. We use goquery to parse it into a navigable structure:

    doc, err := goquery.NewDocumentFromReader(resp.Body)
    

    Inspecting the page

    We now inspect element in chrome to see how the code is structured…

    You can see that the articles are contained inside section tags and with the class story-wrapper

    Extracting Data

    goquery allows jQuery-style element selection and traversal. We find all article containers, loop through them, and pull what we need:

    doc.Find("section.story-wrapper").Each(func(i int, s *goquery.Selection) {
    
      title := s.Find("h3.indicate-hover").Text()
      link := s.Find("a.css-9mylee").Attr("href")
    
      article := Article{Title: title, Link: link}
      articles = append(articles, article)
    
    })
    

    Here we use the class selectors of key elements to target titles and links.

    Printing Output

    Finally, we can print or store the scraped data:

    for _, a := range articles {
      fmt.Println(a.Title)
      fmt.Println(a.Link)
    }
    

    And we have a working scraper!

    Challenges You May Encounter

    There are a few common issues to look out for with web scrapers:

  • HTTP errors from connectivity issues or being blocked
  • Changes to page layouts and selectors causing breaks
  • Going too fast and getting rate limited
  • Practical solutions include:

  • Robust error handling
  • Regularly re-testing scrapers
  • Implementing throttling mechanisms
  • With diligence, these can be overcome.

    Next Steps

    Some ideas for building on this:

  • Scrape additional data like subtitles, authors etc.
  • Save scraped data to file / database
  • Expand to scrape multiple sites
  • Generalize scrapers based on site templates
  • Schedule periodic scrapes
  • Web scraping is a learn-by-doing skill. Experiment with different sites and data points to grow!

    Full Code

    For reference, here is the complete code from the beginning:

    // full code from above
    package main
    
    import (
        "fmt"
        "net/http"
    
        "github.com/PuerkitoBio/goquery"
    )
    
    // Article struct to store title and link 
    type Article struct {
        Title string
        Link string
    }
    
    func main() {
    
        // URL and user agent
        url := "https://www.nytimes.com/"
        userAgent := "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    
        // Create HTTP client
        client := &http.Client{}
    
        // Build GET request
        req, err := http.NewRequest("GET", url, nil)
        if err != nil {
            panic(err)
        }
    
        // Set user agent 
        req.Header.Set("User-Agent", userAgent)
    
        // Send request
        resp, err := client.Do(req)
        if err != nil {
            panic(err)
        }
        defer resp.Body.Close()
    
        // Check status code  
        if resp.StatusCode != 200 {
            fmt.Println("Failed to retrieve web page. Status code:", resp.StatusCode)
            return
        }
    
        // Parse response HTML
        doc, err := goquery.NewDocumentFromReader(resp.Body)
        if err != nil {
            panic(err) 
        }
    
        // Find articles
        var articles []Article
        doc.Find("section.story-wrapper").Each(func(i int, s *goquery.Selection) {
    
            // Extract title and link
            title := s.Find("h3.indicate-hover").Text() 
            link, _ := s.Find("a.css-9mylee").Attr("href")
    
            // Create article
            article := Article{Title: title, Link: link}
    
            // Append article to results 
            articles = append(articles, article) 
        })
    
        // Print articles
        for _, a := range articles {
            fmt.Println(a.Title)
            fmt.Println(a.Link)  
        }
    
    }

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!