Scraping Hacker News in Node.js

This Node.js code scrapes article data from the Hacker News homepage using the axios and cheerio modules. Here's a high-level overview of what it does:

Sends a GET request to the Hacker News URL using axios
Loads the returned HTML content into cheerio
Parses the page and extracts article data by targeting DOM elements
Prints out the scraped data to the console

In this beginner tutorial, we'll go through the code step-by-step to understand how it works under the hood.

This is the page we are talking about…

Install Required Node Modules

To run this web scraper, you need to have Node.js installed and install the axios and cheerio modules:

npm install axios cheerio

Axios handles making the HTTP requests while cheerio allows querying and manipulating the returned HTML just like jQuery.

Make HTTP Request for Hacker News Page

First, we import the axios and cheerio modules:

const axios = require('axios');
const cheerio = require('cheerio');

Next, we define the URL of the page we want to scrape - the Hacker News homepage:

const url = '<https://news.ycombinator.com/>';

We make a GET request using axios to fetch the content of this URL:

axios.get(url)

The axios .get() method returns a promise that resolves with a response object containing the status code and response data.

Load HTML and Parse with Cheerio

In the .then() handler, we check if the status code is 200 meaning the request succeeded.

We then load the HTML content from response.data into Cheerio using the cheerio.load() method. This parses the document and allows us to query it with a jQuery-style syntax.

if (response.status === 200) {

  const $ = cheerio.load(response.data);

}

Find Table Rows to Scrape

Inspecting the page

You can notice that the items are housed inside a tag with the class athing

We grab all the table row elements on the page as we'll loop through them to extract article data:

const rows = $('tr');

Track Current Article and Row

As we loop through the rows, we need to keep track of the current article we're extracting data from and whether we're on an article row or details row:

let currentArticle = null;
let currentRowType = null;

Iterate Through Rows to Scrape Articles

We loop through each of the rows:

rows.each((index, row) => {

  // row parsing code

});

Inside this, we check if it has a class named "athing" indicating it's an article row:

if ($row.attr('class') === 'athing') {

  currentArticle = $row;
  currentRowType = 'article';

}

If not, we check if we're on a details row following an article row:

} else if (currentRowType === 'article') {

  // Extract article data

}

Extract Article Data

Inside here we extract the data we want like title, URL, points, etc. by targeting elements and retrieving attributes, text, etc:

const titleElem = currentArticle.find('.title');

const articleTitle = titleElem.text().trim();

const articleUrl = titleElem.find('a').attr('href');

// ...

Let's break this down in detail...

Extracting Title

We first find the element with class "title" inside currentArticle using .find():

const titleElem = currentArticle.find('.title');

We get the text content of this element with .text(), trim any whitespace, and save to articleTitle:

const articleTitle = titleElem.text().trim();

The key things to understand are:

We use a class selector .title to target the element

text() gets the text content as a string

trim() cleans surrounding whitespace

Extracting URL

For the article URL, we again find the link inside titleElem, then access its href attribute:

const articleUrl = titleElem.find('a').attr('href');

Here we use:

find() to get descendant anchor element

attr() to return its href attribute value

And similarly for points, author, comments, etc!

Print Scraped Article Data

Finally, we print out all the extracted data so we can see the scraped article info:

console.log('Title:', articleTitle);
console.log('URL:', articleUrl);
// etc

This prints the article details to the console separated by dashes.

Full Code

Here is the complete code for reference:

const axios = require('axios');
const cheerio = require('cheerio');

// Define the URL of the Hacker News homepage
const url = 'https://news.ycombinator.com/';

// Send a GET request to the URL
axios.get(url)
  .then((response) => {
    // Check if the request was successful (status code 200)
    if (response.status === 200) {
      // Load the HTML content of the page using Cheerio
      const $ = cheerio.load(response.data);

      // Find all rows in the table
      const rows = $('tr');

      // Initialize variables to keep track of the current article and row type
      let currentArticle = null;
      let currentRowType = null;

      // Iterate through the rows to scrape articles
      rows.each((index, row) => {
        const $row = $(row);

        if ($row.attr('class') === 'athing') {
          // This is an article row
          currentArticle = $row;
          currentRowType = 'article';
        } else if (currentRowType === 'article') {
          // This is the details row
          if (currentArticle) {
            const titleElem = currentArticle.find('.title');
            if (titleElem.length) {
              const articleTitle = titleElem.text().trim();
              const articleUrl = titleElem.find('a').attr('href');

              const subtext = $row.find('.subtext');
              const points = subtext.find('.score').text();
              const author = subtext.find('.hnuser').text();
              const timestamp = subtext.find('.age').attr('title');
              const commentsElem = subtext.find('a:contains("comments")');
              const comments = commentsElem.text().trim() || '0';

              // Print the extracted information
              console.log('Title:', articleTitle);
              console.log('URL:', articleUrl);
              console.log('Points:', points);
              console.log('Author:', author);
              console.log('Timestamp:', timestamp);
              console.log('Comments:', comments);
              console.log('-'.repeat(50)); // Separating articles
            }
          }

          // Reset the current article and row type
          currentArticle = null;
          currentRowType = null;
        } else if ($row.attr('style') === 'height:5px') {
          // This is the spacer row, skip it
          return;
        }
      });
    } else {
      console.log('Failed to retrieve the page. Status code:', response.status);
    }
  })
  .catch((error) => {
    console.error('An error occurred:', error);
  });

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"

We have a running offer of 1000 API calls completely free. Register and get your free API Key.

Scraping Hacker News in Node.js

Install Required Node Modules

Make HTTP Request for Hacker News Page

Load HTML and Parse with Cheerio

Find Table Rows to Scrape

Track Current Article and Row

Iterate Through Rows to Scrape Articles

Extract Article Data

Extracting Title

Extracting URL

Print Scraped Article Data

Full Code

Browse by language:

The easiest way to do Web Scraping

Scraping Hacker News in Node.js

Install Required Node Modules

Make HTTP Request for Hacker News Page

Load HTML and Parse with Cheerio

Find Table Rows to Scrape

Track Current Article and Row

Iterate Through Rows to Scrape Articles

Extract Article Data

Extracting Title

Extracting URL

Print Scraped Article Data

Full Code

The easiest way to do Web Scraping

Don't leave just yet!