The Ultimate JSoup Scala Cheatsheet

Oct 31, 2023 ยท 3 min read

JSoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data from HTML documents using DOM traversal and CSS selectors.

Getting Started

Import JSoup:

import org.jsoup._

Parse HTML:

val doc: Document = Jsoup.parse(html)

Select elements:

val elements: Elements = doc.select("div.class")

Extract text:

val text = doc.body().text()

Update text:

doc.body().text("New text")

Selecting Elements

By CSS query:

doc.select("div.content")

By tag:

doc.getElementsByTag("img")

By id:

doc.getElementById("header")

By attribute:

doc.getElementsByAttribute("href")

Custom filters:

doc.select(".txt").filter(el => el.text.length > 10)

Traversing

Navigate to parent:

element.parent()

Navigate to children:

element.children()

Sideways to siblings:

element.nextElementSibling()
element.previousElementSibling()

Manipulation

Set text:

element.text("new text")

Set HTML:

element.html("<span>new html</span>")

Add class:

element.addClass("highlighted")

Remove class:

element.removeClass("highlighted")

Remove element:

element.remove()

Attributes

Get attribute:

val href = element.attr("href")

Set attribute:

element.attr("href", "link.html")

Remove attribute:

element.removeAttr("class")

Get all attributes:

val attrs = element.attributes()

Examples

Extract text:

doc.select("p").forEach(p => {
  println(p.text())
})

Extract links:

doc.select("a[href]").forEach(a => {
  val href = a.attr("href")
  println(href)
})

Change image src:

doc.select("img").forEach(img => {
  img.attr("src", "new-img.jpg")
})

Validation

Check valid HTML:

val errors = JsoupValidator.createValidatingInstance().validate(doc)
if (errors.hasErrors()) {
  // handle errors
}

Connection Settings

Custom user-agent:

val connection = Jsoup.connect(url).userAgent("Bot")

Custom headers:

connection.headers(Map("Auth" -> "token"))

Timeout:

connection.timeout(10*1000) // 10 seconds

Advanced Usage

Async callbacks:

Jsoup.connect(url).get(new Callback() {
  def success(result: Result) {
    // handle result
  }

  def error(e: Exception) {
    // handle error
  }
})

Multi-threading:

// process pages concurrently
docs.par.foreach(doc => {
  // extract data
})

Common Use Cases

Extract all links:

doc.select("a[href]").forEach(a -> {
    println(a.attr("href"));
})

Extract text from paragraphs:

doc.select("p").forEach(p -> {
    println(p.text());
})

Extract images:

doc.select("img").forEach(img -> {
    String src = img.attr("src");
    // download image from src
})

Submit a form:

Connection.Response res = Jsoup.connect(url)
    .data("username", "example")
    .data("password", "secret")
    .method(Method.POST)
    .execute();

Log in and maintain session:

Connection con = Jsoup.connect(url);
Connection.Response res = con.execute();

Map<String, String> cookies = res.cookies();

Document doc = Jsoup.connect(url2)
    .cookies(cookies)
    .get();

Tips and Best Practices

  • Use connection pooling for efficiency when parsing multiple pages
  • Avoid full document parses if only needing a small section
  • Cache parsed documents when needed multiple times
  • Process pages concurrently using multi-threading
  • Advanced Topics

    Custom request handling:

    HttpConnection con = new HttpConnection() {
      public Response execute(Request request) {
        Response res = super.execute(request);
        // handle response
        return res;
      }
    };
    
    Document doc = con.get(url);
    

    Control network settings:

    con.timeout(5000);
    con.proxy("webproxy", 8080);
    

    JSoup on Android:

    Document doc = Jsoup.parse(string, "", Parser.xmlParser());
    

    Integrate with JSON/JAXB:

    JsonObject json = new JsonObject(doc.html());
    // bind to POJOs
    

    Output cleaned HTML:

    String cleanHtml = Jsoup.clean(dirtyHtml, baseUri, Whitelist.basic());
    

    Sanitize untrusted input:

    String safe = Jsoup.clean(unsafe, Whitelist.none());
    

    Validate against DTDs/schemas:

    Validator v = Validator.nu();
    v.validate(doc, Errors.ReportLevel.FATAL);
    

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: