The Ultimate Cheerio Web Scraping Cheat Sheet

Oct 31, 2023 ยท 4 min read

Cheerio is a fast, flexible web scraping library for Node.js. This cheat sheet provides a comprehensive reference of its syntax and capabilities.

Capabilities Covered

  • Installation
  • Loading HTML
  • Selectors
  • DOM Traversal
  • DOM Manipulation
  • Information
  • Looping
  • Output
  • Plugins
  • Debugging
  • Rate Limiting
  • Caching
  • Best Practices
  • Real World Examples
  • Installation

    Install via npm:

    npm install cheerio
    

    Or Yarn:

    yarn add cheerio
    

    Loading HTML

    Load markup into Cheerio for parsing:

    From String:

    const $ = cheerio.load('<h2 class="title">Hello</h2>')
    

    From File:

    const fs = require('fs');
    const $ = cheerio.load(fs.readFileSync('index.html'));
    

    From URL:

    const $ = cheerio.load(await axios('<https://example.com>'));
    

    From JSON:

    const data = {foo: 'bar'};
    const $ = cheerio.load(JSON.stringify(data));
    

    Selectors

    Query DOM elements using CSS selector syntax:

    IDs:

    $('#my-id');
    

    Classes:

    $('.my-class');
    

    Tags:

    $('ul'); // <ul>
    $('li'); // <li>
    

    Attributes:

    $('a[target=_blank]');
    

    Multiple Classes:

    $('.class1.class2');
    

    Wildcards:

    $('*'); // All elements
    

    Chained:

    $('.outer').find('.inner');
    

    Pseudo Selectors:

    $('a:first');
    $('div:last');
    $('li:nth-child(3)');
    $('a:contains("text")');
    

    DOM Traversal

    Navigate between nodes:

    Parents:

    $('.child').parent();
    

    Children:

    $('.parent').children();
    

    Siblings:

    $('.first-child').next();
    $('.last-child').prev();
    

    Filtering:

    $('.parent').filter('.special').text();
    

    Traverse Up:

    $('.child').closest('.ancestor');
    $('.child').parentsUntil('.grandparent');
    

    Traverse Down:

    $('.parent').find('.child');
    

    DOM Manipulation

    Modify elements and content:

    Set Text:

    $('h1').text('New Text');
    

    Set HTML:

    $('button').html('<b>Save</b>');
    

    Add Class:

    $('.box').addClass('blue');
    

    Remove Class:

    $('.box').removeClass('blue');
    

    Toggle Class:

    $('.box').toggleClass('highlighted');
    

    Set Attributes:

    $('input[type="text"]').attr('name', 'username');
    

    Append:

    $('ul').append('<li class="new">New</li>');
    

    Prepend:

    $('ul').prepend('<li class="new">New</li>');
    

    Before:

    $('li.third').before('<li class="second">Second</li>');
    

    After:

    $('li.third').after('<li class="fourth">Fourth</li>');
    

    Remove:

    $('.deleted').remove();
    

    Wrap Inner:

    $('.message').wrapInner('<b></b>');
    

    Unwrap:

    $('b').unwrap();
    

    Information

    Extract info from elements:

    Text:

    $('h1').text();
    

    HTML:

    $('div').html();
    

    Value:

    $('input[name=first_name]').val();
    

    Attribute:

    $('a').attr('href');
    

    Data Attribute:

    $('.user').data('id');
    

    Looping

    Iterate through elements:

    Each:

    $('li').each((i, el) => {
      // element logic
    });
    

    Map:

    const urls = $('li a').map((i, el) => $(el).attr('href')).get();
    

    Reduce:

    const total = $('.product').reduce((sum, el) => {
      const price = $(el).data('price');
      return sum + price;
    }, 0);
    

    Filter:

    const special = $('.product').filter((i, el) => {
      return $(el).data('special');
    }).get();
    

    Output

    Render final output:

    Full HTML:

    $.html();
    

    Outer HTML:

    $('.box').html();
    

    Text:

    $('.message').text();
    

    JSON:

    JSON.stringify($('.box').map((i, el) => {
       // map to object
     }).get());
    

    Save File:

    fs.writeFileSync('page.html', $.html());
    

    HTTP Response:

    res.send($.html());
    

    Plugins

    Extend functionality:

    Images:

    const images = require('cheerio-image-loader')
    
    images($, '.product img')
      .then(/* ... */)
    

    Videos:

    const videos = require('cheerio-video')
    
    videos($).attr('src', '<https://example.com/trailer.mp4>')
    

    SVG:

    const svg = require('cheerio-svg-parser')
    
    svg.parse($.html()).svg() // SVG DOM
    

    Debugging

    Log and inspect output:

    Elements:

    console.log($('.item'));
    

    HTML:

    console.log($.html());
    

    JSON:

    console.log(JSON.stringify($('.item').map((i, el) => {
      return $(el).text();
    }).get()));
    

    Node REPL:

    const repl = require('repl');
    repl.start('> ').context.$_ = $;
    

    Rate Limiting

    Control request speed:

    Simple Delay:

    await new Promise(resolve => setTimeout(resolve, 1000));
    

    Queue:

    const queue = new PQueue({ concurrency: 2 });
    
    queue.add(() => {
      // Request code
    })
    

    Bottleneck:

    const limiter = new Bottleneck({
      minTime: 1000
    });
    
    limiter.schedule(() => {
      // Request code
    });
    

    Caching

    Save responses:

    In-Memory:

    let cache = {};
    
    const url = '<https://example.com>';
    if (cache[url]) {
      return cache[url];
    } else {
      const resp = await fetch(url);
      cache[url] = resp;
      return resp;
    }
    

    Redis:

    const redis = require('redis');
    const client = redis.createClient();
    
    const key = `cache:${url}`;
    const cached = await client.get(key);
    
    if (cached) {
      return JSON.parse(cached);
    } else {
      const resp = await fetch(url);
      client.set(key, JSON.stringify(resp), 'EX', 3600);
      return resp;
    }
    

    Best Practices

    Tips for effective web scraping:

  • Use CSS/DOM selectors over regex for parsing HTML
  • Validate schemas for consistency
  • Rotate proxies/headers to prevent blocking
  • Cache duplicate requests
  • Limit request rate to avoid flooding servers
  • Use asynchronous logic to maximize throughput
  • Real World Examples

    Common use cases:

  • Scrape pricing data from ecommerce sites
  • Build aggregated feeds from multiple news sources
  • Compile research datasets from public websites
  • Monitor website changes for broken link checking
  • Archive old versions of web pages for historical records
  • Extract structured data from HTML tables
  • Populate headless CMS with imported content
  • Run SEO audits by extracting on-page content
  • Train ML classifiers on HTML data
  • Process files of markup for analysis
  • And that covers the full range of Cheerio's syntax and capabilities. With this handy reference, you can scrape the web more effectively!

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!