The Ultimate Loofah Cheatsheet for Ruby

Nov 4, 2023 ยท 10 min read


Loofah is a Ruby library for parsing and manipulating HTML/XML documents. It provides a simple API for traversing, manipulating, and extracting data from markup. Some key features:

  • Built on top of Nokogiri, so it inherits Nokogiri's speed and Ruby idioms.
  • HTML/XML parsing and traversal.
  • XSS sanitization via Loofah::Scrubber.
  • Built-in scrubbers for stripping unwanted markup.
  • Integration with Rails ActionView helpers.
  • Installation

    Install the loofah gem:

    gem install loofah

    Or in a Gemfile:

    gem 'loofah'

    Require in Ruby:

    require 'loofah'

    Parsing and Traversal

    Parse HTML/XML into a Loofah document:

    html = <<-HTML
          <h1>Hello world!</h1>
          <p>Welcome to my page.</p>
    doc = Loofah.document(html)

    Traverse elements with Nokogiri methods:

    #=> [#<Nokogiri::XML::Element:0x3fc96a44b618 name="h1">]'h1').text
    #=> "Hello world!"

    Find text nodes:

    #=> "Hello world!Welcome to my page."


    Modify the document:'h1').content = "Welcome!"
    puts doc.to_html
    # <html>
    #   <body>
    #     <h1>Welcome!</h1>
    #     ...

    Add new nodes:

    new_para ="p", doc)
    new_para.content = "New paragraph"'body').add_child(new_para)

    XSS Sanitization

    Loofah provides XSS sanitization via the Loofah::Scrubber class.

    Remove unwanted tags/attributes:

    html = "<script>alert('xss')</script><div>Test</div>"
    doc = Loofah.document(html)
    puts doc.to_html
    #=> "<div>Test</div>"

    Customize scrubbing behavior:

    class CustomScrubber < Loofah::Scrubber
      def scrub(node)
        node.remove if == 'script'
    doc = Loofah.document(html)
    puts doc.to_html
    #=> "<script>alert('xss')</script><div>Test</div>"

    Built-in scrubbers:

  • Loofah::Scrubber - Remove all tags/attributes.
  • Loofah::HtmlScrubber - Allow certain tags/attrs, strip others.
  • Loofah::StripScrubber - Remove given tags/attributes.
  • Integration with Rails

    Loofah integrates with Rails helpers:

    # In a Rails view
    @content = "<script>alert('xss')</script>Test"
    sanitize @content, scrubber:
    #=> "Test"
    sanitize @content, tags: ['b', 'i']
    #=> "<b>Test</b>"

    Use Loofah as the Rails sanitizer:

    # In config/initializers/loofah.rb
    Rails.application.config.action_view.sanitized_allowed_tags = ['strong']
    Rails.application.config.action_view.sanitized_allowed_attributes = ['style']
    ActionView::Base.sanitizer = Loofah

    Now Rails helpers will use Loofah for sanitization.


    Avoid slow XPath expressions:

    # Slow
    # Faster'body').at('div:nth-child(2)').at('span')

    Benchmark scrubber performance:

    require 'benchmark'
    n = 1000 do |x|'Strip') { n.times { doc.scrub!( } }'White') { n.times { doc.scrub!( } }

    Web Scraping

    Extract text from HTML:

    doc.text # All text
    doc.css('h1').map(&:text) # Headings'//p').map(&:text).join(". ") # Paragraphs

    Extract links:

    doc.css('a').map { |a| {href: a['href'], text: a.text} }


    Test scrubbers with RSpec:

    RSpec.describe MyScrubber do
      it 'scrubs scripts' do
        html = '<script>alert(1)</script>'
        doc = Loofah.document(html)
        scrubber =
        expect(doc.to_html).not_to include('<script>')

    Assert correct scrubbing with Minitest:

    class LoofahTest < Minitest::Test
      def test_scrub_xss
        html = '<script>alert(1)</script>'
        doc = Loofah.document(html)
        scrubber =
        assert_equal '<div></div>', doc.to_html

    JavaScript Frameworks


    Sanitize HTML before rendering:

    // Component
    import loofah from 'loofah'
    function MyComponent({html}) {
      const cleanHtml = loofah.scrubHtml(html, {scrubber: Loofah.Scrubber})
      return <div dangerouslySetInnerHTML={{__html: cleanHtml}} />


    Scrub HTML in server-rendered app:

    // server.js
    const loofah = require('loofah')
    app.get('/', (req, res) => {
      const html = loofah.scrubHtml(someHtml) // Scrub on server
      res.renderVue('index.html', { html })


    Create scrubbing pipe:

    // scrub.pipe.ts
    import { Pipe } from '@angular/core';
    import loofah from 'loofah';
    @Pipe({name: 'scrub'})
    export class ScrubPipe {
      transform(html: string) {
        return loofah.scrubHtml(html);

    Use in template:

    <!-- template.html -->
    <div [innerHtml]="someHtml | scrub"></div>

    Debugging Issues

    Handle encoding errors:

    doc = Loofah.document(html.force_encoding('UTF-8'))

    Gracefully parse malformed HTML:

    doc = Loofah.parse(bad_html)
    doc.errors # Inspect errors! # Attempt fix

    Clone before scrubbing to avoid side effects:

    node ='p').dup

    Advanced Nokogiri

    Namespaced XML:'//x:node', {'x' => '<>'})

    Optimize XPath with CSS selectors:'div#content @class="text"')

    NodeSet manipulation:

    nodes = doc.css('p.note')
    nodes.each { |n| ... }

    Scraping Frameworks


    from scrapy.pipeines.images import ImagesPipeline
    from loofah import scrub_html
    class MyImagesPipeline(ImagesPipeline):
        def process_html(self, response, spider):
            cleaned = scrub_html(response.body)
            return HtmlResponse(url=response.url, body=cleaned)


    class MySpider < Kimurai::Base
      def parse(response)
        # ... scrape response.doc

    Immutable Documents

    Clone before manipulating:

    doc2 = doc.clone'img').remove

    Use version with destructive methods disabled:

    doc = Loofah::ImmutableDocument.parse(some_html)
    doc.scrub! # Error raised

    Authentication Integration

    # ApplicationController
    before_action :scrub_html
    def scrub_html
      loofah.scrub_params!(params, scrubber:

    Efficient Manipulation

    Modify multiple nodes:'//img').each do |img|
      img['src'] = '/placeholder.jpg'

    Remove nodesets:

    articles = doc.css('article')


    Handle encoding issues:

    doc = Loofah.document(html_string.force_encoding('UTF-8'))

    Fix malformed HTML:

    html = <<-HTML
    doc = Loofah.document(html)!

    Avoid unintended node changes:

    node ='h1')
    node = node.dup # Scrub a copy

    Advanced Selectors

    Grouped CSS selectors:

    doc.css('div.note, span.alert')

    jQuery selectors:

    doc.css('div:not(.ignore)') # Negation
    doc.css('li:contains("hello")') # Text contains

    Namespaced XML:'//x:node', 'x' => 'namespace')

    React Integration

    Sanitize HTML before rendering:

    // Component
    import loofah from 'loofah'
    function MyComponent({html}) {
      const cleanHtml = loofah.scrubHtml(html)
      return <div dangerouslySetInnerHTML={{__html: cleanHtml}} />

    Kimurai Scraping

    Configure Loofah scrubber:

    class MySpider < Kimurai::Base
      def parse(response)
        # ... scrape response.doc

    Handling Complex HTML Structures

    Loofah is versatile and can handle complex HTML structures. For example, you can easily navigate and manipulate deeply nested elements:

    html = <<-HTML
            <h1>Article Title</h1>
            <p>Content goes here.</p>
    doc = Loofah.document(html)
    # Access deeply nested elements
    article_title ='div > section > article > h1').text

    Optimizing Performance

    To optimize performance, avoid using slow XPath expressions and prefer CSS selectors when possible:

    # Slow XPath expression
    slow_node = doc.at_xpath('//body//div[2]//span')
    # Faster CSS selector equivalent
    fast_node ='body div:nth-child(2) span')

    Security Best Practices

    Handling Security Vulnerabilities

    Loofah helps mitigate security vulnerabilities like Cross-Site Scripting (XSS) attacks. Here's an example of using a custom scrubber to remove potentially harmful script tags:

    class CustomScrubber < Loofah::Scrubber
      def scrub(node)
        node.remove if == 'script'
    html_with_xss = "<script>alert('xss')</script><div>Safe content</div>"
    doc = Loofah.document(html_with_xss)
    cleaned_html = doc.to_html
    # Result: "<div>Safe content</div>"

    Integration with Other Libraries

    Integrating Loofah with Nokogiri

    Loofah is built on top of Nokogiri, so you can use Nokogiri methods for parsing and manipulation:

    require 'nokogiri'
    html = <<-HTML
        <p>Hello, <strong>world!</strong></p>
    nokogiri_doc = Nokogiri::HTML(html)
    # Use Nokogiri methods to traverse and manipulate
    strong_text ='strong').text


    Loofah vs. Sanitize

    Loofah and the Sanitize gem both offer HTML sanitization, but they have different approaches. Loofah allows fine-grained control with custom scrubbers, while Sanitize provides a simpler, rules-based approach:

    # Using Sanitize for XSS sanitization
    require 'sanitize'
    html_with_xss = "<script>alert('xss')</script><div>Safe content</div>"
    sanitized_html = Sanitize.fragment(html_with_xss)
    # Result: "<div>Safe content</div>"


    Q: How does Loofah handle different character encodings?

    A: Loofah can handle various character encodings. Ensure you set the correct encoding using force_encoding('UTF-8') for your HTML string to avoid encoding issues:

    html_string = "HTML content"
    doc = Loofah.document(html_string.force_encoding('UTF-8'))

    Q: Does Loofah support HTML5?

    A: Yes, Loofah supports HTML5. It utilizes Nokogiri, which is capable of parsing and manipulating HTML5 documents.

    Q: How can I sanitize mixed content (safe and unsafe) with Loofah?

    A: Customize Loofah's behavior by creating custom scrubbers. For instance, remove script tags while keeping safe content:

    class CustomScrubber < Loofah::Scrubber
      def scrub(node)
        node.remove if == 'script'
    html = "<script>alert('xss')</script><div>Safe content</div>"
    doc = Loofah.document(html)
    cleaned_html = doc.to_html

    Q: Can I customize Loofah's sanitization for specific tags or attributes?

    A: Yes, create custom scrubbers to define rules. For example, allow 'href' attributes for links while removing other tags:

    class CustomScrubber < Loofah::Scrubber
      def scrub(node)
        if == 'a'
          node['href'] = node['href'].strip if node['href']
    html = "<a href='<>' target='_blank'>Visit Example</a><script>alert('xss')</script>"
    doc = Loofah.document(html)
    cleaned_html = doc.to_html

    Q: Is Loofah vulnerable to security issues?

    A: Loofah is designed to mitigate security vulnerabilities, including XSS attacks. Keep Loofah and its dependencies up-to-date to stay protected. Monitor the Loofah GitHub repository and RubyGems for updates and advisories.

    Q: What are the benefits of using Nokogiri over Loofah, and vice versa?

    A: Nokogiri

  • Fine-Grained Control: Offers granular control over HTML/XML parsing for custom solutions.
  • Versatility: Supports both HTML and XML parsing.
  • Performance: Known for efficient parsing and speed.
  • Extensive Ecosystem: Large community with ample documentation and plugins.
  • Low-Level Manipulation: Ideal for custom data extraction and manipulation.
  • Loofah

  • HTML Sanitization: Specializes in HTML sanitization, particularly for XSS mitigation.
  • Simplicity: Simplifies HTML content sanitization.
  • Rails Integration: Seamlessly integrates with Ruby on Rails.
  • Higher-Level Abstraction: Provides a higher-level abstraction for ease of use.
  • Built-In Rules: Includes ready-made rules for common sanitization tasks.
  • The choice between Nokogiri and Loofah depends on your project's specific needs. Nokogiri is versatile and performance-oriented, while Loofah excels at HTML sanitization and simplicity.


    Handling Malformed HTML

    If you have malformed HTML, Loofah can help repair it. Use the repair! method to attempt to fix issues:

    malformed_html = <<-HTML
        <p>Unclosed div
    doc = Loofah.document(malformed_html)!
    # The document is now repaired and can be used safely

    Cross-Framework Integration

    Using Loofah with Sinatra

    Integrating Loofah with Sinatra is straightforward, similar to using it with Rails:

    require 'sinatra'
    require 'loofah'
    before do
      @content = "<script>alert('xss')</script>Some content"
      @cleaned_content = Loofah.scrub(@content, scrubber:
    get '/' do
      erb :index

    References and Further Reading

    Here are some references and further reading materials for in-depth knowledge:

  • Loofah GitHub Repo
  • Loofah Documentation
  • Nokogiri Tutorial
  • Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you

    Try ProxiesAPI for free

    curl ""

    <!doctype html>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />


    Don't leave just yet!

    Enter your email below to claim your free API key: