The Ultimate Loofah Cheatsheet for Ruby

Nov 4, 2023 ยท 10 min read

Overview

Loofah is a Ruby library for parsing and manipulating HTML/XML documents. It provides a simple API for traversing, manipulating, and extracting data from markup. Some key features:

  • Built on top of Nokogiri, so it inherits Nokogiri's speed and Ruby idioms.
  • HTML/XML parsing and traversal.
  • XSS sanitization via Loofah::Scrubber.
  • Built-in scrubbers for stripping unwanted markup.
  • Integration with Rails ActionView helpers.
  • Installation

    Install the loofah gem:

    gem install loofah
    

    Or in a Gemfile:

    gem 'loofah'
    

    Require in Ruby:

    require 'loofah'
    

    Parsing and Traversal

    Parse HTML/XML into a Loofah document:

    html = <<-HTML
      <html>
        <body>
          <h1>Hello world!</h1>
          <p>Welcome to my page.</p>
        </body>
      </html>
    HTML
    
    doc = Loofah.document(html)
    

    Traverse elements with Nokogiri methods:

    doc.css('h1')
    #=> [#<Nokogiri::XML::Element:0x3fc96a44b618 name="h1">]
    
    doc.at('h1').text
    #=> "Hello world!"
    

    Find text nodes:

    doc.text
    #=> "Hello world!Welcome to my page."
    

    Manipulation

    Modify the document:

    doc.at('h1').content = "Welcome!"
    
    puts doc.to_html
    # <html>
    #   <body>
    #     <h1>Welcome!</h1>
    #     ...
    

    Add new nodes:

    new_para = Nokogiri::XML::Node.new("p", doc)
    new_para.content = "New paragraph"
    
    doc.at('body').add_child(new_para)
    

    XSS Sanitization

    Loofah provides XSS sanitization via the Loofah::Scrubber class.

    Remove unwanted tags/attributes:

    html = "<script>alert('xss')</script><div>Test</div>"
    
    doc = Loofah.document(html)
    doc.scrub!(Loofah::Scrubber.new)
    
    puts doc.to_html
    #=> "<div>Test</div>"
    

    Customize scrubbing behavior:

    class CustomScrubber < Loofah::Scrubber
      def scrub(node)
        node.remove if node.name == 'script'
      end
    end
    
    doc = Loofah.document(html)
    doc.scrub!(CustomScrubber.new)
    puts doc.to_html
    #=> "<script>alert('xss')</script><div>Test</div>"
    

    Built-in scrubbers:

  • Loofah::Scrubber - Remove all tags/attributes.
  • Loofah::HtmlScrubber - Allow certain tags/attrs, strip others.
  • Loofah::StripScrubber - Remove given tags/attributes.
  • Integration with Rails

    Loofah integrates with Rails helpers:

    # In a Rails view
    
    @content = "<script>alert('xss')</script>Test"
    
    sanitize @content, scrubber: Loofah::Scrubber.new
    #=> "Test"
    
    sanitize @content, tags: ['b', 'i']
    #=> "<b>Test</b>"
    

    Use Loofah as the Rails sanitizer:

    # In config/initializers/loofah.rb
    
    Rails.application.config.action_view.sanitized_allowed_tags = ['strong']
    Rails.application.config.action_view.sanitized_allowed_attributes = ['style']
    
    ActionView::Base.sanitizer = Loofah
    

    Now Rails helpers will use Loofah for sanitization.

    Performance

    Avoid slow XPath expressions:

    # Slow
    doc.at_xpath('//body//div[2]//span')
    
    # Faster
    doc.at('body').at('div:nth-child(2)').at('span')
    

    Benchmark scrubber performance:

    require 'benchmark'
    
    n = 1000
    Benchmark.bm do |x|
      x.report('Strip') { n.times { doc.scrub!(Loofah::StripScrubber.new) } }
      x.report('White') { n.times { doc.scrub!(Loofah::Html5libSanitizer.new) } }
    end
    

    Web Scraping

    Extract text from HTML:

    doc.text # All text
    
    doc.css('h1').map(&:text) # Headings
    
    doc.search('//p').map(&:text).join(". ") # Paragraphs
    

    Extract links:

    doc.css('a').map { |a| {href: a['href'], text: a.text} }
    

    Testing

    Test scrubbers with RSpec:

    RSpec.describe MyScrubber do
      it 'scrubs scripts' do
        html = '<script>alert(1)</script>'
        doc = Loofah.document(html)
    
        scrubber = MyScrubber.new
        doc.scrub!(scrubber)
    
        expect(doc.to_html).not_to include('<script>')
      end
    end
    

    Assert correct scrubbing with Minitest:

    class LoofahTest < Minitest::Test
      def test_scrub_xss
        html = '<script>alert(1)</script>'
        doc = Loofah.document(html)
    
        scrubber = Loofah::Scrubber.new
        doc.scrub!(scrubber)
    
        assert_equal '<div></div>', doc.to_html
      end
    end
    

    JavaScript Frameworks

    React

    Sanitize HTML before rendering:

    // Component
    import loofah from 'loofah'
    
    function MyComponent({html}) {
      const cleanHtml = loofah.scrubHtml(html, {scrubber: Loofah.Scrubber})
      return <div dangerouslySetInnerHTML={{__html: cleanHtml}} />
    }
    

    Vue

    Scrub HTML in server-rendered app:

    // server.js
    const loofah = require('loofah')
    
    app.get('/', (req, res) => {
      const html = loofah.scrubHtml(someHtml) // Scrub on server
      res.renderVue('index.html', { html })
    })
    

    Angular

    Create scrubbing pipe:

    // scrub.pipe.ts
    import { Pipe } from '@angular/core';
    import loofah from 'loofah';
    
    @Pipe({name: 'scrub'})
    export class ScrubPipe {
      transform(html: string) {
        return loofah.scrubHtml(html);
      }
    }
    

    Use in template:

    <!-- template.html -->
    <div [innerHtml]="someHtml | scrub"></div>
    

    Debugging Issues

    Handle encoding errors:

    doc = Loofah.document(html.force_encoding('UTF-8'))
    

    Gracefully parse malformed HTML:

    doc = Loofah.parse(bad_html)
    doc.errors # Inspect errors
    
    doc.repair! # Attempt fix
    

    Clone before scrubbing to avoid side effects:

    node = doc.at('p').dup
    scrubber.scrub(node)
    

    Advanced Nokogiri

    Namespaced XML:

    doc.search('//x:node', {'x' => '<http://name.space>'})
    

    Optimize XPath with CSS selectors:

    doc.at('div#content @class="text"')
    

    NodeSet manipulation:

    nodes = doc.css('p.note')
    nodes.each { |n| ... }
    nodes.remove
    

    Scraping Frameworks

    Scrapy

    from scrapy.pipeines.images import ImagesPipeline
    from loofah import scrub_html
    
    class MyImagesPipeline(ImagesPipeline):
    
        def process_html(self, response, spider):
            cleaned = scrub_html(response.body)
            return HtmlResponse(url=response.url, body=cleaned)
    

    Kimurai

    class MySpider < Kimurai::Base
      def parse(response)
        response.doc.scrub!(Loofah::Scrubber.new)
        # ... scrape response.doc
      end
    end
    

    Immutable Documents

    Clone before manipulating:

    doc2 = doc.clone
    doc2.at('img').remove
    

    Use version with destructive methods disabled:

    doc = Loofah::ImmutableDocument.parse(some_html)
    
    doc.scrub! # Error raised
    

    Authentication Integration

    # ApplicationController
    
    before_action :scrub_html
    
    def scrub_html
      loofah.scrub_params!(params, scrubber: Scrubber.new)
    end
    

    Efficient Manipulation

    Modify multiple nodes:

    doc.search('//img').each do |img|
      img['src'] = '/placeholder.jpg'
    end
    

    Remove nodesets:

    articles = doc.css('article')
    articles.remove
    

    Debugging

    Handle encoding issues:

    doc = Loofah.document(html_string.force_encoding('UTF-8'))
    

    Fix malformed HTML:

    html = <<-HTML
    <div>
      <span>Hello
    </div>
    HTML
    
    doc = Loofah.document(html)
    doc.repair!
    

    Avoid unintended node changes:

    node = doc.at('h1')
    node = node.dup # Scrub a copy
    
    scrubber.scrub(node)
    

    Advanced Selectors

    Grouped CSS selectors:

    doc.css('div.note, span.alert')
    

    jQuery selectors:

    doc.css('div:not(.ignore)') # Negation
    doc.css('li:contains("hello")') # Text contains
    

    Namespaced XML:

    doc.search('//x:node', 'x' => 'namespace')
    

    React Integration

    Sanitize HTML before rendering:

    // Component
    import loofah from 'loofah'
    
    function MyComponent({html}) {
      const cleanHtml = loofah.scrubHtml(html)
      return <div dangerouslySetInnerHTML={{__html: cleanHtml}} />
    }
    

    Kimurai Scraping

    Configure Loofah scrubber:

    class MySpider < Kimurai::Base
      def parse(response)
        response.doc.scrub!(Loofah::Scrubber.new)
        # ... scrape response.doc
      end
    end
    

    Handling Complex HTML Structures

    Loofah is versatile and can handle complex HTML structures. For example, you can easily navigate and manipulate deeply nested elements:

    html = <<-HTML
      <div>
        <section>
          <article>
            <h1>Article Title</h1>
            <p>Content goes here.</p>
          </article>
        </section>
      </div>
    HTML
    
    doc = Loofah.document(html)
    
    # Access deeply nested elements
    article_title = doc.at('div > section > article > h1').text
    

    Optimizing Performance

    To optimize performance, avoid using slow XPath expressions and prefer CSS selectors when possible:

    # Slow XPath expression
    slow_node = doc.at_xpath('//body//div[2]//span')
    
    # Faster CSS selector equivalent
    fast_node = doc.at('body div:nth-child(2) span')
    

    Security Best Practices

    Handling Security Vulnerabilities

    Loofah helps mitigate security vulnerabilities like Cross-Site Scripting (XSS) attacks. Here's an example of using a custom scrubber to remove potentially harmful script tags:

    class CustomScrubber < Loofah::Scrubber
      def scrub(node)
        node.remove if node.name == 'script'
      end
    end
    
    html_with_xss = "<script>alert('xss')</script><div>Safe content</div>"
    doc = Loofah.document(html_with_xss)
    doc.scrub!(CustomScrubber.new)
    
    cleaned_html = doc.to_html
    # Result: "<div>Safe content</div>"
    

    Integration with Other Libraries

    Integrating Loofah with Nokogiri

    Loofah is built on top of Nokogiri, so you can use Nokogiri methods for parsing and manipulation:

    require 'nokogiri'
    
    html = <<-HTML
      <div>
        <p>Hello, <strong>world!</strong></p>
      </div>
    HTML
    
    nokogiri_doc = Nokogiri::HTML(html)
    
    # Use Nokogiri methods to traverse and manipulate
    strong_text = nokogiri_doc.at('strong').text
    

    Comparisons

    Loofah vs. Sanitize

    Loofah and the Sanitize gem both offer HTML sanitization, but they have different approaches. Loofah allows fine-grained control with custom scrubbers, while Sanitize provides a simpler, rules-based approach:

    # Using Sanitize for XSS sanitization
    require 'sanitize'
    
    html_with_xss = "<script>alert('xss')</script><div>Safe content</div>"
    sanitized_html = Sanitize.fragment(html_with_xss)
    
    # Result: "<div>Safe content</div>"
    

    FAQs

    Q: How does Loofah handle different character encodings?

    A: Loofah can handle various character encodings. Ensure you set the correct encoding using force_encoding('UTF-8') for your HTML string to avoid encoding issues:

    html_string = "HTML content"
    doc = Loofah.document(html_string.force_encoding('UTF-8'))
    

    Q: Does Loofah support HTML5?

    A: Yes, Loofah supports HTML5. It utilizes Nokogiri, which is capable of parsing and manipulating HTML5 documents.

    Q: How can I sanitize mixed content (safe and unsafe) with Loofah?

    A: Customize Loofah's behavior by creating custom scrubbers. For instance, remove script tags while keeping safe content:

    class CustomScrubber < Loofah::Scrubber
      def scrub(node)
        node.remove if node.name == 'script'
      end
    end
    
    html = "<script>alert('xss')</script><div>Safe content</div>"
    doc = Loofah.document(html)
    doc.scrub!(CustomScrubber.new)
    
    cleaned_html = doc.to_html
    

    Q: Can I customize Loofah's sanitization for specific tags or attributes?

    A: Yes, create custom scrubbers to define rules. For example, allow 'href' attributes for links while removing other tags:

    class CustomScrubber < Loofah::Scrubber
      def scrub(node)
        if node.name == 'a'
          node['href'] = node['href'].strip if node['href']
        else
          node.remove
        end
      end
    end
    
    html = "<a href='<https://example.com>' target='_blank'>Visit Example</a><script>alert('xss')</script>"
    doc = Loofah.document(html)
    doc.scrub!(CustomScrubber.new)
    
    cleaned_html = doc.to_html
    

    Q: Is Loofah vulnerable to security issues?

    A: Loofah is designed to mitigate security vulnerabilities, including XSS attacks. Keep Loofah and its dependencies up-to-date to stay protected. Monitor the Loofah GitHub repository and RubyGems for updates and advisories.

    Q: What are the benefits of using Nokogiri over Loofah, and vice versa?

    A: Nokogiri

  • Fine-Grained Control: Offers granular control over HTML/XML parsing for custom solutions.
  • Versatility: Supports both HTML and XML parsing.
  • Performance: Known for efficient parsing and speed.
  • Extensive Ecosystem: Large community with ample documentation and plugins.
  • Low-Level Manipulation: Ideal for custom data extraction and manipulation.
  • Loofah

  • HTML Sanitization: Specializes in HTML sanitization, particularly for XSS mitigation.
  • Simplicity: Simplifies HTML content sanitization.
  • Rails Integration: Seamlessly integrates with Ruby on Rails.
  • Higher-Level Abstraction: Provides a higher-level abstraction for ease of use.
  • Built-In Rules: Includes ready-made rules for common sanitization tasks.
  • The choice between Nokogiri and Loofah depends on your project's specific needs. Nokogiri is versatile and performance-oriented, while Loofah excels at HTML sanitization and simplicity.

    Troubleshooting

    Handling Malformed HTML

    If you have malformed HTML, Loofah can help repair it. Use the repair! method to attempt to fix issues:

    malformed_html = <<-HTML
      <div>
        <p>Unclosed div
      </div>
    HTML
    
    doc = Loofah.document(malformed_html)
    doc.repair!
    
    # The document is now repaired and can be used safely
    

    Cross-Framework Integration

    Using Loofah with Sinatra

    Integrating Loofah with Sinatra is straightforward, similar to using it with Rails:

    require 'sinatra'
    require 'loofah'
    
    before do
      @content = "<script>alert('xss')</script>Some content"
      @cleaned_content = Loofah.scrub(@content, scrubber: Loofah::Scrubber.new)
    end
    
    get '/' do
      erb :index
    end
    

    References and Further Reading

    Here are some references and further reading materials for in-depth knowledge:

  • Loofah GitHub Repo
  • Loofah Documentation
  • Nokogiri Tutorial
  • Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!