The Ultimate JSoup Kotlin Cheatsheet

Oct 31, 2023 ยท 3 min read

JSoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data from HTML documents using DOM traversal and CSS selectors.

Getting Started

Add dependency:

implementation("org.jsoup:jsoup:1.15.3")

Parse HTML:

val html = "<html>...</html>"
val doc = Jsoup.parse(html)

Select elements:

val elements = doc.select(".content")

Extract text:

val text = doc.body().text()

Selecting Elements

By CSS query:

doc.select(".main")

By tag:

doc.getElementsByTag("img")

By id:

doc.getElementById("header")

By attribute:

doc.getElementsByAttribute("href")

Custom filters:

doc.select(".text").filter { it.text().length > 10 }

Traversing

Navigate up:

element.parent()

Navigate down:

element.children()

Sideways:

element.nextElementSibling()
element.previousElementSibling()

Manipulation

Set text:

element.text("new text")

Set HTML:

element.html("<span>new html</span>")

Add class:

element.addClass("highlighted")

Remove class:

element.removeClass("highlighted")

Remove element:

element.remove()

Attributes

Get attribute:

val href = element.attr("href")

Set attribute:

element.attr("href", "link")

Remove attribute:

element.removeAttr("class")

Examples

Extract text from paragraphs:

doc.select("p").forEach {
  println(it.text())
}

Extract links:

doc.select("a[href]").forEach {
  println(it.attr("href"))
}

Change image src:

doc.select("img").forEach {
  it.attr("src", "new.png")
}

Validation

Check valid HTML:

val errors = Validator.createValidatingInstance().validate(doc)
if (errors.isNotEmpty()) {
  // handle errors
}

Advanced Usage

Async parsing:

Jsoup.connect(url).get() { doc ->
  // process doc
}

Custom headers:

val headers = mapOf("Auth" to "token")
Jsoup.connect(url).headers(headers)

More Element Selection Examples

By element:

doc.getElementsByTag("div")

By ID and class:

doc.getElementById("header")
doc.getElementsByClass("article")

Combinators:

doc.select(".article td")

Attribute value:

doc.select("[width=500]")

Navigating the DOM

Go up:

element.parent()
element.closest(".content")

Sideways:

element.nextSibling()
element.previousSibling()

All ancestors:

element.ancestors()

Modifying the Document

Add element:

doc.body().appendChild(newElement)

Remove element:

element.remove()

Set attribute:

element.attr("href", "link")

Set CSS style:

element.css("color", "red")

Validation

Check valid HTML:

val errors = Validator.createValidatingInstance().validate(doc)
if(errors.isEmpty()) {
  print("Document is valid")
} else {
  errors.forEach { println(it) }
}

Configure options:

val settings = Validator.Settings.html()
  .setErrorMode(ErrorMode.RELAXED)

Output cleaned HTML

Clean the HTML and output it after making changes:

val cleanedHtml = doc.html()
// Output to file, network, etc.

Comments and CDATA

Get comments:

val comments = doc.getAllElements().filter { it.nodeName() == "#comment" }

Get CDATA sections:

val cdata = doc.getAllElements().filter { it.nodeName() == "#cdata-section" }

Working with forms

Get form by ID:

val form = doc.getElementById("login-form")

Get input by name:

val usernameInput = form.select("#username")

Set input value:

usernameInput.val("myuser")

Multi-threaded scraping

Scrape in multiple threads:

val urls = listOf("url1", "url2")

val exec = Executors.newFixedThreadPool(10)

urls.forEach {
  exec.submit {
    Jsoup.connect(it).get() // parse in parallel
  }
}

exec.shutdown()

Efficient selection

Cache selections:

val headers = doc.select("#headers").first() // cache in variable

Avoid re-parsing:

doc.select(".item").remove() // doesn't re-parse entire doc

Parser configuration

Custom parser settings:

val parser = Parser.htmlParser()
  .setTrackErrors(10) // number of errors to track
  .setTimeout(10*1000) // 10 second timeout

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


Try ProxiesAPI for free

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
...

X

Don't leave just yet!

Enter your email below to claim your free API key: