The Ultimate Floki Cheatsheet for Elixir

Oct 31, 2023 ยท 4 min read

Floki makes it easy to parse and query HTML documents in Elixir. It uses CSS selectors and tree traversal for HTML manipulation.

Getting Started

Add dependency:

def deps do
  [
    {:floki, "~> 0.10.0"}
  ]
end

Parse HTML:

html = File.read!("index.html")
doc = Floki.parse_document!(html)

Find elements:

Floki.find(doc, "div.content")

Get text:

Floki.text(doc)

Selecting

By CSS selector:

Floki.find(doc, "div.main")

By tag name:

Floki.find(doc, "img")

By id:

Floki.find_by_id(doc, "header")

By attribute:

Floki.find_by_attribute(doc, "href")

Traversing

Get parent:

[parent | _] = Floki.parents(element)

Get children:

Floki.children(element)

Get siblings:

Floki.siblings(element)

Manipulation

Insert element:

Floki.insert_after(new_el, target_el)

Replace element:

Floki.replace(new_el, target_el)

Remove element:

Floki.remove(element)

Update attribute:

Floki.update_attribute(element, "src", "new.jpg")

Append html:

Floki.append(doc, "<div>New div</div>")

Parsing HTML

From string:

html = "<html>...</html>"
doc = Floki.parse_document!(html)

From file:

doc = Floki.parse_document!(File.read!("index.html"))

From URL:

doc = Floki.parse_document!(HTTPoison.get!(url).body)

Extracting Data

Extract text:

Floki.text(doc)

Find links:

Floki.find(doc, "a[href]") |> Floki.attribute("href")

Extract images:

Floki.find(doc, "img") |> Floki.attribute("src")

Advanced Usage

Parse fragments:

doc = Floki.parse_fragment(html_fragment)

Encode special chars:

Floki.raw_html(html) # escape HTML

Decode entities:

Floki.unescape_and_decode(html)

Inspect HTML tree:

IO.inspect(doc) # print HTML tree

More Examples

Find by class name:

Floki.find(doc, ".article")

Nest selectors:

Floki.find(doc, "div.content ul li a")

Traverse tree:

parent = Floki.parent(element)
children = Floki.children(element)

Manipulate HTML:

Floki.insert_after(new_div, content_div)
Floki.replace(new_img_el, old_img_el)
Floki.remove(ad_div)

Extract text, links, images:

text = Floki.text(doc)
links = Floki.find(doc, "a[href]") |> Floki.attribute("href")
imgs = Floki.find(doc, "img") |> Floki.attribute("src")

Advanced Usage

Parse fragments:

fragment = "<div>...</div>"
doc = Floki.parse_fragment(fragment)

Escape HTML:

html = "<div>10 > 5</div>"
escaped = Floki.raw_html(html)

Unescape HTML:

html = "&lt;div&gt;Hello&lt;/div&gt;"
unescaped = Floki.unescape_and_decode(html)

Inspect tree:

html
|> Floki.parse_document!
|> IO.inspect

Lazy Loading

Floki.HTMLTree.parse loads HTML lazily to avoid parsing the entire document at once:

html = File.read!("large.html")
tree = Floki.HTMLTree.parse(html)

# Elements loaded as needed
meta = Floki.find(tree, "meta")
head = Floki.find(tree, "head")

This is more efficient for large HTML documents.

Search vs Find

Floki.search searches all nodes while Floki.find only searches subtree at that element:

Floki.search(tree, "meta") # all nodes
Floki.find(tree, "head meta") # only in head

So use find when you can scope the search for better performance.

LiveView Integration

Floki can parse HTML in Phoenix LiveView on the server before sending to client:

def handle_info(%{topic: "new_html"}, socket) do
  html = ExternalApi.fetch_html()
  doc = Floki.parse_document!(html)

  # Manipulate doc

  html = Floki.serialize(doc)
  {:reply, {:ok, html}, socket}
end

HTML to CSV/JSON

Use Floki to extract data from HTML to other formats like CSV/JSON:

html
|> Floki.parse_document!
|> Floki.find("table tr")
|> CSV.encode()
|> IO.write()
html
|> Floki.parse_document!
|> Floki.find("div.post")
|> Enum.map(&post_to_map/1)
|> JSON.encode!()
|> IO.write()

Invalid HTML

Floki can handle invalid/malformed HTML by passing html_trim: false option.

Idempotent HTML

Sort attributes to normalize HTML for consistent re-parsing:

doc
|> Floki.find("div")
|> Floki.update_attributes(fn attributes ->
  Enum.sort(attributes)
end)

Browse by tags:

Browse by language:

Tired of getting blocked while scraping the web?

ProxiesAPI handles headless browsers and rotates proxies for you.
Get access to 1,000 free API credits, no credit card required!