The Ultimate Goquery Cheatsheet

Oct 31, 2023 ยท 6 min read

Goquery is a Go library that provides jQuery-style DOM manipulation. It makes it easy to parse and extract data from HTML documents using a syntax similar to jQuery.

Getting Started

Import the goquery package:

import "github.com/PuerkitoBio/goquery"

Load HTML from a string:

doc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlString))

Or load from a file:

doc, err := goquery.NewDocumentFromFile(filename)

Check for errors when loading the document.

Selection

Select elements similar to jQuery:

doc.Find(".class")
doc.Find("#id")
doc.Find("div")

Chaining:

doc.Find(".outer").Find(".inner")

Select by index:

doc.Find("div").Eq(1)

Select parent/children:

doc.Find(".inner").Parent()
doc.Find("div").Children()

Filter selection:

doc.Find(".text").Filter(".inner-text")

Find vs FindSelection:

Find performs a search across all descendants of the document and returns a Selection object.

FindSelection searches only the current Selection's descendants.

Matchers:

You can define a matcher function to perform custom filtering:

hasText := func(i int, s *goquery.Selection) bool {
  return s.Text() == "Some text"
}

doc.Find("div").FilterFunction(hasText)

Traversing

Traverse to siblings:

doc.Find(".inner").Next() // next sibling
doc.Find(".inner").Prev() // previous sibling

Traverse up and down:

doc.Find(".inner").Parent() // parent
doc.Find(".outer").Children() // children

Contents:

Get child nodes contents:

doc.Find("div").Contents()

Slice:

Get sibling range as slice:

doc.Find("li").Slice(2, 5)

Manipulation

Get/set text:

doc.Find("h1").Text() // get
doc.Find("h1").Text("New header") // set

Get/set HTML:

doc.Find("div").Html() // get
doc.Find("div").Html(`<span>New content</span>`) // set

Add/remove classes:

doc.Find(".outer").AddClass("container") // add class
doc.Find(".inner").RemoveClass("highlighted") // remove class

Empty:

Remove all child nodes:

doc.Find("ul").Empty()

Append/Prepend:

Insert adjacent to selection:

doc.Find("ul").Append("<li>New</li>")
doc.Find("ul").Prepend("<li>New</li>")

Wrap/Unwrap:

Wrap selection in new parent element:

doc.Find("span").Wrap("<div>")

Remove wrapper element:

doc.Find("span").Unwrap()

Attributes

Get an attribute value:

doc.Find("a").Attr("href")

Set an attribute value:

doc.Find("a").Attr("href", "new-url")

Remove an attribute:

doc.Find("table").RemoveAttr("width")

Get all attributes as a map:

attrs := doc.Find("div").Attributes()

Data:

Get custom data attributes:

doc.Find("div").Data("myattr")

Iteration

Iterate through selections:

doc.Find(".inner").Each(func(i int, s *goquery.Selection) {
  // do something with s
})

Helper iteration methods:

doc.Find(".inner").EachWithBreak(func(i int, s *goquery.Selection) bool {
  return false // break iteration
})

doc.Find(".inner").Map(func(i int, s *goquery.Selection) string {
  return s.Text() // return value
})

Slice:

Iterate selection as a slice:

for _, item := range doc.Find("li").Slice() {
  // item is *Selection
}

Utilities

Serialize selection as HTML:

html, _ := doc.Find(".outer").Html()

Check if selection contains element:

doc.Find(".container").Has(".button").Length() > 0

Get number of elements in selection:

doc.Find(".items").Length()

Clone:

Clone document:

newDoc := doc.Clone()

Parse:

Re-parse document:

root, err := goquery.Parse(doc)

Is/End:

Check selection type:

doc.Find("div").Is("div") // true
doc.Find("ul").End() == 0 // at end

Common Use Cases

Web Scraping:

Extract data from pages:

doc.Find(".titles").Each(func(i int, s *goquery.Selection) {
  title := s.Text()
  fmt.Println(title)
})

Parse HTML:

Process HTML documents:

doc.Find("a[rel='nofollow']").Each(func(i int, s *goquery.Selection) {
  s.Remove() // clean up HTML
})

Make Changes:

Modify HTML pages:

doc.Find("img").Each(func(i int, s *goquery.Selection) {
  s.SetAttr("src", newSrc) // set new img src
})

Remote HTML:

Use with HTTP requests:

res, _ := http.Get(url)
doc, _ := goquery.NewDocumentFromResponse(res)

Selection Strategies

Targeting Elements:

Unique IDs:

doc.Find("#header")

Known classes:

doc.Find(".product-listing")

Nested selections:

doc.Find("#container").Find(".row .product")

Dynamic Content:

Re-parse after JavaScript:

doc, _ = goquery.NewDocumentFromReader(browser.HTML())

Wait for element to appear:

sel := doc.Find(".loaded")
for !sel.Length() {
   time.Sleep(1 * time.Second)
   doc = getNewDoc()
   sel = doc.Find(".loaded")
}

Ads and Popups:

Remove unwanted elements:

doc.Find(".ad-banner").Remove()

Blocking:

Throttle requests:

time.Sleep(2 * time.Second) // slow down

Rotate user agents:

uas := []string{
  "Mozilla/5.0...",
  "Chrome/87.0..."
}

// cycle through uas

JavaScript Content:

Use browser automation:

doc, _ := goquery.NewDocumentFromReader(browser.HTML())

Use API if available:

data := getAPIJSON() // may provide HTML
doc, _ := goquery.NewDocumentFromReader(bytes.NewReader(data))

Tips and Tricks

  • Use a struct to store extracted data
  • Validate selections to avoid nil panics
  • Prefix class names to avoid conflicts
  • Reuse selections stored in variables
  • Consider thread safety with concurrency
  • Use a helper library like gocache for caching
  • Enable goquery debug logs for troubleshooting
  • Always check for errors when loading documents
  • FAQ

    When to use goquery vs standard packages?

    Use goquery for HTML manipulation like jQuery. Use standard packages for XML parsing or creating HTML output.

    What are some alternatives to goquery?

    colly for scraping, gopherjs+jQuery for client side DOM manipulation.

    How to test and validate selections?

    Use Is() to validate selection name, Length() to check size, Each() to iterate.

    How to mock documents for testing?

    Use goquery.NewDocumentFromReader() with strings.NewReader() to load test HTML.

    Summary

  • Load HTML with NewDocumentFromReader, NewDocumentFromResponse etc.
  • Use Find, FindSelection and traversal methods to select nodes.
  • Manipulate the DOM tree and extract data.
  • Iterate through selections with Each, Map etc.
  • Use helper methods like Text, Attr, Length to get information.
  • Select strategically and handle dynamic content.
  • Goquery brings the power of jQuery to Go for easy HTML manipulation and extraction. With Go's speed and concurrency, goquery is great for web scraping and building web apps.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!