The Ultimate Floki Cheatsheet for Elixir

Oct 31, 2023 ยท 4 min read

Floki makes it easy to parse and query HTML documents in Elixir. It uses CSS selectors and tree traversal for HTML manipulation.

Getting Started

Add dependency:

def deps do
  [
    {:floki, "~> 0.10.0"}
  ]
end

Parse HTML:

html = File.read!("index.html")
doc = Floki.parse_document!(html)

Find elements:

Floki.find(doc, "div.content")

Get text:

Floki.text(doc)

Selecting

By CSS selector:

Floki.find(doc, "div.main")

By tag name:

Floki.find(doc, "img")

By id:

Floki.find_by_id(doc, "header")

By attribute:

Floki.find_by_attribute(doc, "href")

Traversing

Get parent:

[parent | _] = Floki.parents(element)

Get children:

Floki.children(element)

Get siblings:

Floki.siblings(element)

Manipulation

Insert element:

Floki.insert_after(new_el, target_el)

Replace element:

Floki.replace(new_el, target_el)

Remove element:

Floki.remove(element)

Update attribute:

Floki.update_attribute(element, "src", "new.jpg")

Append html:

Floki.append(doc, "<div>New div</div>")

Parsing HTML

From string:

html = "<html>...</html>"
doc = Floki.parse_document!(html)

From file:

doc = Floki.parse_document!(File.read!("index.html"))

From URL:

doc = Floki.parse_document!(HTTPoison.get!(url).body)

Extracting Data

Extract text:

Floki.text(doc)

Find links:

Floki.find(doc, "a[href]") |> Floki.attribute("href")

Extract images:

Floki.find(doc, "img") |> Floki.attribute("src")

Advanced Usage

Parse fragments:

doc = Floki.parse_fragment(html_fragment)

Encode special chars:

Floki.raw_html(html) # escape HTML

Decode entities:

Floki.unescape_and_decode(html)

Inspect HTML tree:

IO.inspect(doc) # print HTML tree

More Examples

Find by class name:

Floki.find(doc, ".article")

Nest selectors:

Floki.find(doc, "div.content ul li a")

Traverse tree:

parent = Floki.parent(element)
children = Floki.children(element)

Manipulate HTML:

Floki.insert_after(new_div, content_div)
Floki.replace(new_img_el, old_img_el)
Floki.remove(ad_div)

Extract text, links, images:

text = Floki.text(doc)
links = Floki.find(doc, "a[href]") |> Floki.attribute("href")
imgs = Floki.find(doc, "img") |> Floki.attribute("src")

Advanced Usage

Parse fragments:

fragment = "<div>...</div>"
doc = Floki.parse_fragment(fragment)

Escape HTML:

html = "<div>10 > 5</div>"
escaped = Floki.raw_html(html)

Unescape HTML:

html = "&lt;div&gt;Hello&lt;/div&gt;"
unescaped = Floki.unescape_and_decode(html)

Inspect tree:

html
|> Floki.parse_document!
|> IO.inspect

Lazy Loading

Floki.HTMLTree.parse loads HTML lazily to avoid parsing the entire document at once:

html = File.read!("large.html")
tree = Floki.HTMLTree.parse(html)

# Elements loaded as needed
meta = Floki.find(tree, "meta")
head = Floki.find(tree, "head")

This is more efficient for large HTML documents.

Search vs Find

Floki.search searches all nodes while Floki.find only searches subtree at that element:

Floki.search(tree, "meta") # all nodes
Floki.find(tree, "head meta") # only in head

So use find when you can scope the search for better performance.

LiveView Integration

Floki can parse HTML in Phoenix LiveView on the server before sending to client:

def handle_info(%{topic: "new_html"}, socket) do
  html = ExternalApi.fetch_html()
  doc = Floki.parse_document!(html)

  # Manipulate doc

  html = Floki.serialize(doc)
  {:reply, {:ok, html}, socket}
end

HTML to CSV/JSON

Use Floki to extract data from HTML to other formats like CSV/JSON:

html
|> Floki.parse_document!
|> Floki.find("table tr")
|> CSV.encode()
|> IO.write()
html
|> Floki.parse_document!
|> Floki.find("div.post")
|> Enum.map(&post_to_map/1)
|> JSON.encode!()
|> IO.write()

Invalid HTML

Floki can handle invalid/malformed HTML by passing html_trim: false option.

Idempotent HTML

Sort attributes to normalize HTML for consistent re-parsing:

doc
|> Floki.find("div")
|> Floki.update_attributes(fn attributes ->
  Enum.sort(attributes)
end)

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


Try ProxiesAPI for free

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
...

X

Don't leave just yet!

Enter your email below to claim your free API key: