Scraping Hacker News in Node.js

Jan 21, 2024 · 6 min read

This Node.js code scrapes article data from the Hacker News homepage using the axios and cheerio modules. Here's a high-level overview of what it does:

  1. Sends a GET request to the Hacker News URL using axios
  2. Loads the returned HTML content into cheerio
  3. Parses the page and extracts article data by targeting DOM elements
  4. Prints out the scraped data to the console

In this beginner tutorial, we'll go through the code step-by-step to understand how it works under the hood.

This is the page we are talking about…

Install Required Node Modules

To run this web scraper, you need to have Node.js installed and install the axios and cheerio modules:

npm install axios cheerio

Axios handles making the HTTP requests while cheerio allows querying and manipulating the returned HTML just like jQuery.

Make HTTP Request for Hacker News Page

First, we import the axios and cheerio modules:

const axios = require('axios');
const cheerio = require('cheerio');

Next, we define the URL of the page we want to scrape - the Hacker News homepage:

const url = '<https://news.ycombinator.com/>';

We make a GET request using axios to fetch the content of this URL:

axios.get(url)

The axios .get() method returns a promise that resolves with a response object containing the status code and response data.

Load HTML and Parse with Cheerio

In the .then() handler, we check if the status code is 200 meaning the request succeeded.

We then load the HTML content from response.data into Cheerio using the cheerio.load() method. This parses the document and allows us to query it with a jQuery-style syntax.

if (response.status === 200) {

  const $ = cheerio.load(response.data);

}

Find Table Rows to Scrape

Inspecting the page

You can notice that the items are housed inside a tag with the class athing

We grab all the table row elements on the page as we'll loop through them to extract article data:

const rows = $('tr');

Track Current Article and Row

As we loop through the rows, we need to keep track of the current article we're extracting data from and whether we're on an article row or details row:

let currentArticle = null;
let currentRowType = null;

Iterate Through Rows to Scrape Articles

We loop through each of the rows:

rows.each((index, row) => {

  // row parsing code

});

Inside this, we check if it has a class named "athing" indicating it's an article row:

if ($row.attr('class') === 'athing') {

  currentArticle = $row;
  currentRowType = 'article';

}

If not, we check if we're on a details row following an article row:

} else if (currentRowType === 'article') {

  // Extract article data

}

Extract Article Data

Inside here we extract the data we want like title, URL, points, etc. by targeting elements and retrieving attributes, text, etc:

const titleElem = currentArticle.find('.title');

const articleTitle = titleElem.text().trim();

const articleUrl = titleElem.find('a').attr('href');

// ...

Let's break this down in detail...

Extracting Title

We first find the element with class "title" inside currentArticle using .find():

const titleElem = currentArticle.find('.title');

We get the text content of this element with .text(), trim any whitespace, and save to articleTitle:

const articleTitle = titleElem.text().trim();

The key things to understand are:

  • We use a class selector .title to target the element
  • text() gets the text content as a string
  • trim() cleans surrounding whitespace
  • Extracting URL

    For the article URL, we again find the link inside titleElem, then access its href attribute:

    const articleUrl = titleElem.find('a').attr('href');
    

    Here we use:

  • find() to get descendant anchor element
  • attr() to return its href attribute value
  • And similarly for points, author, comments, etc!

    Print Scraped Article Data

    Finally, we print out all the extracted data so we can see the scraped article info:

    console.log('Title:', articleTitle);
    console.log('URL:', articleUrl);
    // etc
    

    This prints the article details to the console separated by dashes.

    Full Code

    Here is the complete code for reference:

    const axios = require('axios');
    const cheerio = require('cheerio');
    
    // Define the URL of the Hacker News homepage
    const url = 'https://news.ycombinator.com/';
    
    // Send a GET request to the URL
    axios.get(url)
      .then((response) => {
        // Check if the request was successful (status code 200)
        if (response.status === 200) {
          // Load the HTML content of the page using Cheerio
          const $ = cheerio.load(response.data);
    
          // Find all rows in the table
          const rows = $('tr');
    
          // Initialize variables to keep track of the current article and row type
          let currentArticle = null;
          let currentRowType = null;
    
          // Iterate through the rows to scrape articles
          rows.each((index, row) => {
            const $row = $(row);
    
            if ($row.attr('class') === 'athing') {
              // This is an article row
              currentArticle = $row;
              currentRowType = 'article';
            } else if (currentRowType === 'article') {
              // This is the details row
              if (currentArticle) {
                const titleElem = currentArticle.find('.title');
                if (titleElem.length) {
                  const articleTitle = titleElem.text().trim();
                  const articleUrl = titleElem.find('a').attr('href');
    
                  const subtext = $row.find('.subtext');
                  const points = subtext.find('.score').text();
                  const author = subtext.find('.hnuser').text();
                  const timestamp = subtext.find('.age').attr('title');
                  const commentsElem = subtext.find('a:contains("comments")');
                  const comments = commentsElem.text().trim() || '0';
    
                  // Print the extracted information
                  console.log('Title:', articleTitle);
                  console.log('URL:', articleUrl);
                  console.log('Points:', points);
                  console.log('Author:', author);
                  console.log('Timestamp:', timestamp);
                  console.log('Comments:', comments);
                  console.log('-'.repeat(50)); // Separating articles
                }
              }
    
              // Reset the current article and row type
              currentArticle = null;
              currentRowType = null;
            } else if ($row.attr('style') === 'height:5px') {
              // This is the spacer row, skip it
              return;
            }
          });
        } else {
          console.log('Failed to retrieve the page. Status code:', response.status);
        }
      })
      .catch((error) => {
        console.error('An error occurred:', error);
      });

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
    
    

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!