Web Scraping New York Times News Headlines with Node.js

Dec 6, 2023 · 6 min read

Web scraping is the process of extracting data from websites automatically. In this article, we'll walk through code that scrapes article titles and links from the New York Times homepage using Node.js modules like request and cheerio.

Why Scrape the New York Times?

The New York Times publishes tons of high-quality content every day on topics like news, opinion, arts, living, and more. Scraping the site allows you to extract and store this content to power other applications. For example, you could:

  • Build a daily digest by scraping the latest headlines
  • Create a local search engine of NYT content
  • Analyze article sentiment over time
  • Archive interesting articles to read later
  • The possibilities are vast once you have structured data from a site like The Times!

    Step 1: Import Needed Modules

    Let's walk through the code section-by-section. First we import the modules we'll need:

    const request = require('request'); // for sending HTTP requests
    const cheerio = require('cheerio'); // for selecting/parsing HTML
    const fs = require('fs'); // for writing to the filesystem
    

    We use the request module to send the HTTP request to fetch the Times homepage.

    Cheerio allows us to select elements in the HTML of the page, kind of like jQuery.

    The fs module is used at the end for writing the scraped data to a JSON file.

    Step 2: Define the URL and Request Options

    Next we set the URL to scrape and define some options for our HTTP request:

    // NYTimes URL
    const url = '<https://www.nytimes.com/>';
    
    // Request settings
    const options = {
      url: url,
      headers: {
        'User-Agent': 'Mozilla/5.0'
      }
    };
    

    Here we are scraping the main nytimes.com homepage URL.

    We also set a custom User-Agent header to mimic a real web browser, which prevents getting blocked as a bot.

    Step 3: Send the Request and Load HTML

    With our URL and options defined, we use request to grab the page HTML:

    // Send request
    request(options, (err, res, html) => {
    
      // Load HTML
      let $;
      try {
        $ = cheerio.load(html);
      } catch(err) {
        console.log('Cheerio error:', err);
        return;
      }
    
    

    We pass our options to request and provide a callback to handle the response.

    Inside, we use cheerio's load method to parse the HTML string into a Cheerio object that we can query (stored in $).

    This gives us jQuery-style selectors to extract data.

    Step 4: Define Variables and Select Elements

    Now that we have the page loaded, we can start extracting the data we want - article titles and links:

    Inspecting the page

    We now inspect element in chrome to see how the code is structured…

    You can see that the articles are contained inside section tags and with the class story-wrapper

    // Initialize variables
    let titles = [];
    let links = [];
    
    // Select articles
    $('section.story-wrapper').each(function() {
    
      // Get data
      const title = $(this).find('h3').text().trim();
      const link = $(this).find('a').attr('href');
    
      // Validate
      if (!title || !link) {
        return;
      }
    
      // Save data
      titles.push(title);
      links.push(link);
    });
    

    First we define some arrays to store the info we scrape.

    We select all section elements with the class story-wrapper, loop through, grab the text content from the h3 elements for titles and get the href attribute from links.

    We do some validation to make sure we have valid content before saving to our arrays.

    Step 5: Log and Store the Scraped Data

    With our titles and links arrays populated, we wrap up by logging and storing the data:

    // Check no articles
    if (titles.length === 0) {
      console.log('No articles found');
      return;
    }
    
    // Log articles
    titles.forEach((title, i) => {
      console.log('Title:', title);
      console.log('Link:', links[i]);
      console.log();
    });
    
    // Write to file
    fs.writeFileSync('articles.txt', JSON.stringify({ titles, links }));
    

    First we make sure we actually captured articles by checking the length.

    Then we log each title/link pair in the console so you can validate it worked.

    Finally, we use fs to write the data to a JSON file for later use.

    And that's it! Here is the full code for reference:

    // Import modules
    const request = require('request');
    const cheerio = require('cheerio'); 
    const fs = require('fs');
    
    // NYTimes URL
    const url = 'https://www.nytimes.com/';  
    
    // Request settings
    const options = {
      url: url,
      headers: {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' 
      }
    };
    
    // Send request
    request(options, (err, res, html) => {
    
      // Check errors
      if (err) { 
        console.log('Error:', err);
        return;
      }
    
      if(res.statusCode !== 200) {
        console.log('Status:', res.statusCode); 
        return;
      }
    
      // Load HTML
      let $; 
      try {
        $ = cheerio.load(html); 
      } catch(err) {
        console.log('Cheerio error:', err);
        return;
      }
    
      // Initialize variables
      let titles = []; 
      let links = [];
    
      // Select articles
      $('section.story-wrapper').each(function() {
       
        // Get data 
        const title = $(this).find('h3').text().trim();
        const link = $(this).find('a').attr('href');
        
        // Validate
        if (!title || !link) {
          return;
        }
    
        // Save data
        titles.push(title);
        links.push(link);
    
      });
    
      // Check if no articles
      if (titles.length === 0) {
        console.log('No articles found');
        return; 
      }
    
      // Log articles  
      titles.forEach((title, i) => {
        console.log('Title:', title);
        console.log('Link:', links[i]);
        console.log();
      });
    
      // Write to file
      fs.writeFileSync('articles.txt', JSON.stringify({ titles, links }));
    
    });
    

    Next Steps

    With this foundation, you can now:

  • Customize the selectors to scrape other sites
  • Set this script on a schedule to run automatically
  • Expand the data you capture from each article
  • Analyze the sentiment of headlines over time
  • Build a local search engine based on NYT content
  • Play around and make it your own!
  • The key is that you now understand how to programmatically grab data from a site using Node.js. The possibilities are endless :)

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!