The Ultimate JSoup Scala Cheatsheet

JSoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data from HTML documents using DOM traversal and CSS selectors.

Getting Started

Import JSoup:

import org.jsoup._

Parse HTML:

val doc: Document = Jsoup.parse(html)

Select elements:

val elements: Elements = doc.select("div.class")

Extract text:

val text = doc.body().text()

Update text:

doc.body().text("New text")

Selecting Elements

By CSS query:

doc.select("div.content")

By tag:

doc.getElementsByTag("img")

By id:

doc.getElementById("header")

By attribute:

doc.getElementsByAttribute("href")

Custom filters:

doc.select(".txt").filter(el => el.text.length > 10)

Traversing

Navigate to parent:

element.parent()

Navigate to children:

element.children()

Sideways to siblings:

element.nextElementSibling()
element.previousElementSibling()

Manipulation

Set text:

element.text("new text")

Set HTML:

element.html("<span>new html</span>")

Add class:

element.addClass("highlighted")

Remove class:

element.removeClass("highlighted")

Remove element:

element.remove()

Attributes

Get attribute:

val href = element.attr("href")

Set attribute:

element.attr("href", "link.html")

Remove attribute:

element.removeAttr("class")

Get all attributes:

val attrs = element.attributes()

Examples

Extract text:

doc.select("p").forEach(p => {
  println(p.text())
})

Extract links:

doc.select("a[href]").forEach(a => {
  val href = a.attr("href")
  println(href)
})

Change image src:

doc.select("img").forEach(img => {
  img.attr("src", "new-img.jpg")
})

Validation

Check valid HTML:

val errors = JsoupValidator.createValidatingInstance().validate(doc)
if (errors.hasErrors()) {
  // handle errors
}

Connection Settings

Custom user-agent:

val connection = Jsoup.connect(url).userAgent("Bot")

Custom headers:

connection.headers(Map("Auth" -> "token"))

Timeout:

connection.timeout(10*1000) // 10 seconds

Advanced Usage

Async callbacks:

Jsoup.connect(url).get(new Callback() {
  def success(result: Result) {
    // handle result
  }

  def error(e: Exception) {
    // handle error
  }
})

Multi-threading:

// process pages concurrently
docs.par.foreach(doc => {
  // extract data
})

Common Use Cases

Extract all links:

doc.select("a[href]").forEach(a -> {
    println(a.attr("href"));
})

Extract text from paragraphs:

doc.select("p").forEach(p -> {
    println(p.text());
})

Extract images:

doc.select("img").forEach(img -> {
    String src = img.attr("src");
    // download image from src
})

Submit a form:

Connection.Response res = Jsoup.connect(url)
    .data("username", "example")
    .data("password", "secret")
    .method(Method.POST)
    .execute();

Log in and maintain session:

Connection con = Jsoup.connect(url);
Connection.Response res = con.execute();

Map<String, String> cookies = res.cookies();

Document doc = Jsoup.connect(url2)
    .cookies(cookies)
    .get();

Tips and Best Practices

Use connection pooling for efficiency when parsing multiple pages

Avoid full document parses if only needing a small section

Cache parsed documents when needed multiple times

Process pages concurrently using multi-threading

Advanced Topics

Custom request handling:

HttpConnection con = new HttpConnection() {
  public Response execute(Request request) {
    Response res = super.execute(request);
    // handle response
    return res;
  }
};

Document doc = con.get(url);

Control network settings:

con.timeout(5000);
con.proxy("webproxy", 8080);

JSoup on Android:

Document doc = Jsoup.parse(string, "", Parser.xmlParser());

Integrate with JSON/JAXB:

JsonObject json = new JsonObject(doc.html());
// bind to POJOs

Output cleaned HTML:

String cleanHtml = Jsoup.clean(dirtyHtml, baseUri, Whitelist.basic());

Sanitize untrusted input:

String safe = Jsoup.clean(unsafe, Whitelist.none());

Validate against DTDs/schemas:

Validator v = Validator.nu();
v.validate(doc, Errors.ReportLevel.FATAL);

The Ultimate JSoup Scala Cheatsheet

Getting Started

Selecting Elements

Traversing

Manipulation

Attributes

Examples

Validation

Connection Settings

Advanced Usage

Common Use Cases

Tips and Best Practices

Advanced Topics

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

The Ultimate JSoup Scala Cheatsheet

Getting Started

Selecting Elements

Traversing

Manipulation

Attributes

Examples

Validation

Connection Settings

Advanced Usage

Common Use Cases

Tips and Best Practices

Advanced Topics

The easiest way to do Web Scraping

Don't leave just yet!