How to Scrape Reddit Posts in Go

Jan 9, 2024 · 7 min read

This article explains in detail how to write a program in Go that downloads a Reddit page and extracts information about posts. It's aimed at beginners who want a step-by-step guide to web scraping Reddit with Go.

here is the page we are talking about

Prerequisites

To follow along, you'll need:

  • Go installed on your machine
  • The goquery package, which can be installed by running:
  • go get github.com/PuerkitoBio/goquery
    

    Overview

    Here's a quick overview of what the program does:

    1. Defines the Reddit URL to download and a User-Agent string
    2. Makes a GET request to the URL using the User-Agent
    3. Checks if the request succeeded (200 status code)
    4. Saves the HTML content from the response to a file
    5. Parses the HTML content using goquery
    6. Finds div elements with certain classes using a selector
    7. Loops through the elements, extracting various attributes
    8. Prints the extracted data - permalinks, comments count, etc.

    Now let's walk through exactly how it works.

    Importing Packages

    We import several packages that provide the functionality we need:

    import (
      "fmt"
      "io/ioutil"
      "net/http"
      "os"
      "strings"
      "github.com/PuerkitoBio/goquery"
    )
    
  • fmt - printing formatted output
  • ioutil - reading and writing files
  • net/http - making HTTP requests
  • os - file system functions
  • strings - string manipulation
  • goquery - HTML parsing
  • Defining Variables

    Next we set some key variables:

    // URL of Reddit page
    redditURL := "<https://www.reddit.com>"
    
    // User agent header
    userAgent := "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    

    The user agent mimics a desktop Chrome browser. Some websites block scraping bots, so this allows our request to look like a normal browser visit.

    Making the GET Request

    We make a GET request to the Reddit URL using the defined user agent:

    // HTTP client
    client := &http.Client{}
    
    req, err := http.NewRequest("GET", redditURL, nil)
    
    if err != nil {
      // ... handle error
    }
    
    req.Header.Set("User-Agent", userAgent)
    
    resp, err := client.Do(req)
    
    if err != nil {
     // ... handle error
    }
    

    We create a new HTTP client, make a GET request to the URL, set the user agent header, send the request and get the response.

    Checking Status

    It's important to check if the request succeeded before trying to read the response:

    if resp.StatusCode == 200 {
    
      // Request succeeded!
      // Proceed to parse response...
    
    } else {
    
      // Handle error
      fmt.Printf("Failed, status code %d", resp.StatusCode)
    }
    

    A status code of 200 means success. Any other code means an error occurred.

    Saving HTML

    We can save the HTML content to a file for inspection:

    bodyBytes, _ := ioutil.ReadAll(resp.Body)
    
    htmlContent := string(bodyBytes)
    
    filename := "reddit_page.html"
    
    ioutil.WriteFile(filename, []byte(htmlContent), 0644)
    
    fmt.Printf("Saved to %s", filename)
    

    This reads the response body into a byte slice, converts it to a string, and writes it out.

    Parsing HTML with goquery

    Now we have the HTML content, we can parse it and extract data using the goquery package.

    First we load it into a goquery document:

    reader := strings.NewReader(htmlContent)
    
    doc, err := goquery.NewDocumentFromReader(reader)
    

    This parses the HTML string into a queryable document.

    Selecting Elements

    Inspecting the elements

    Upon inspecting the HTML in Chrome, you will see that each of the posts have a particular element shreddit-post and class descriptors specific to them…

    goquery allows us to use CSS selectors to find elements, just like jQuery.

    Let's go through this complex selector step-by-step:

    doc.Find("div.block.relative.cursor-pointer.bg-neutral-background.focus-within:bg-neutral-background-hover.hover:bg-neutral-background-hover.xs:rounded-[16px].p-md.my-2xs.nd:visible")
    

    Breaking it down:

  • div.block - Find
    elements with class block
  • .relative - Also match class relative
  • .cursor-pointer - and class cursor-pointer
  • .bg-neutral-background - and class bg-neutral-background
  • .focus-within:bg-neutral-background-hover - When focused within, have class bg-neutral-background-hover
  • .hover:bg-neutral-background-hover - When hovered, have class bg-neutral-background-hover
  • .xs:rounded-[16px] - On extra small screens, have class rounded-[16px]
  • .p-md - Have padding size md
  • .my-2xs - Have vertical (top/bottom) margin size 2xs
  • .nd:visible - Only match elements that are visible
  • This selects "card" div elements having the various classes and attributes specified.

    Why go through each part?. Selectors can get complex, so understanding what each piece matches is important.

    Looping Through Elements

    We can now loop through the matching elements and extract data from each:

    doc.Find("div.block...").Each(func(i int, s *goquery.Selection) {
    
      // Extract data from current element as s
    
    })
    

    The .Each method calls the function for each element found, passing in the loop index and the current selection.

    Let's look at what we extract from each card element:

    Extracting the permalink

    We get the permalink URL using the permalink attribute:

    permalink, _ := s.Attr("permalink")
    

    Getting the content URL

    The content URL is stored in content-href:

    contentHref, _ := s.Attr("content-href")
    

    Fetching comment count

    The number of comments is in comment-count:

    commentCount, _ := s.Attr("comment-count")
    

    Scraping the post title

    We can query within the current selection to get the title element, then extract its text:

    postTitle := s.Find("div[slot=title]").Text()
    

    This selects the

    with slot="title" within s and gets the element's inner text.

    Getting author and score

    The author name and score attributes extract similarly:

    author, _ := s.Attr("author")
    score, _ := s.Attr("score")
    

    Printing the data

    Finally, we print out each extracted attribute, for example:

    Permalink: <https://www.reddit.com/r/videos/comments/xxxxxxxx>
    
    Content URL: <https://v.redd.it/yyyyyyyyy>
    
    Comment Count: 5823
    
    Post Title: My favorite cat video
    
    Author: cool_dude94
    
    Score: 15338
    

    And that covers the key parts of how this Reddit scraping script works! Let's recap...

    Full Code

    Here is the complete code to scrape Reddit posts in Go covered in this article:

    package main
    
    import (
        "fmt"
        "io/ioutil"
        "net/http"
        "os"
        "strings"
    
        "github.com/PuerkitoBio/goquery"
    )
    
    func main() {
        // Define the Reddit URL you want to download
        redditURL := "https://www.reddit.com"
    
        // Define a User-Agent header
        userAgent := "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    
        // Create an HTTP client with the specified User-Agent
        client := &http.Client{}
        req, err := http.NewRequest("GET", redditURL, nil)
        if err != nil {
            fmt.Printf("Failed to create a request: %v\n", err)
            return
        }
        req.Header.Set("User-Agent", userAgent)
    
        // Send the GET request to the URL with the User-Agent header
        resp, err := client.Do(req)
        if err != nil {
            fmt.Printf("Failed to send GET request: %v\n", err)
            return
        }
        defer resp.Body.Close()
    
        // Check if the request was successful (status code 200)
        if resp.StatusCode == 200 {
            // Read the HTML content of the page
            bodyBytes, err := ioutil.ReadAll(resp.Body)
            if err != nil {
                fmt.Printf("Failed to read response body: %v\n", err)
                return
            }
            htmlContent := string(bodyBytes)
    
            // Specify the filename to save the HTML content
            filename := "reddit_page.html"
    
            // Save the HTML content to a file
            err = ioutil.WriteFile(filename, []byte(htmlContent), 0644)
            if err != nil {
                fmt.Printf("Failed to save HTML content to a file: %v\n", err)
            } else {
                fmt.Printf("Reddit page saved to %s\n", filename)
            }
    
            // Parse the HTML content
            reader := strings.NewReader(htmlContent)
            doc, err := goquery.NewDocumentFromReader(reader)
            if err != nil {
                fmt.Printf("Failed to parse HTML content: %v\n", err)
                return
            }
    
            // Find all blocks with the specified tag and class
            doc.Find("div.block.relative.cursor-pointer.bg-neutral-background.focus-within:bg-neutral-background-hover.hover:bg-neutral-background-hover.xs:rounded-[16px].p-md.my-2xs.nd:visible").Each(func(i int, s *goquery.Selection) {
                permalink, _ := s.Attr("permalink")
                contentHref, _ := s.Attr("content-href")
                commentCount, _ := s.Attr("comment-count")
                postTitle := s.Find("div[slot=title]").Text()
                author, _ := s.Attr("author")
                score, _ := s.Attr("score")
    
                // Print the extracted information for each block
                fmt.Println("Permalink:", permalink)
                fmt.Println("Content Href:", contentHref)
                fmt.Println("Comment Count:", commentCount)
                fmt.Println("Post Title:", postTitle)
                fmt.Println("Author:", author)
                fmt.Println("Score:", score)
                fmt.Println()
            })
        } else {
            fmt.Printf("Failed to download Reddit page (status code %d)\n", resp.StatusCode)
        }
    }
    

    By breaking down each step, I hope it's clearer how we:

  • Make a request and handle the response
  • Use goquery to parse and query HTML
  • Extract attributes, text and more from elements
  • Loop through multiple matching elements
  • Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!