May 2nd, 2020

Web Scraping Weather Data Using Node JS and Puppeteer

In this article, we will learn how to quickly scrape the Weather.com 10 day forecast data using Puppeteer.

Puppeteer uses the Chromium browser behind the scenes to actually render HTML and Javascript and so is very useful if getting the content that is loaded by javascript/AJAX functions.

For this, you will need to install Puppeteer inside a directory where you will write the scripts to scrape the data. For example, make a directory like this...

mkdir puppeteer

cd puppeteer

npm install puppeteer --save

That will take a moment to install Puppeteer and Chromium.

Once done, let's start with a script like this...

const puppeteer = require('puppeteer');

puppeteer.launch({ headless: true, args: ['--user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3312.0 Safari/537.36"'] }).then(async browser => {

	const page = await browser.newPage();
	await page.goto("https://weather.com/en-IN/weather/tenday/l/aff9460b9160c73ff01769fd83ae82cf37cb27fb7eb73c70b91257d413147b69");
	await page.waitForSelector('body');

    var rposts = await page.evaluate(() => {
       

    });

    console.log(rposts);
    await browser.close();

}).catch(function(error) {
    console.error(error);
});

Even though it looks like a lot, it just loads up the Puppeteer browser, creates a new page and loads the URL we want and waits for the full of the HTML to be loaded.

The evaluate function now gets into the page's content and allows you to query it with puppeteer's query functions and CSS selectors.

The second line where the launch happens instructs Puppeteer to load in in the headless mode so you dont see the browser but its there behind the scenes. The —user-agent string imitates a Chrome browser on a Mac so you dont get blocked.

Save this file as get_weather.js and if you run it, it should not return any errors.

node get_weather.js

Now let's see if we can scrape some data...

Open Chrome and navigate to the weather.com website.

We are going to scrape the dates, the description, temperature, precipitation, wind and humidity data. Let's open the inspect tool to see what we are up against.

You can see with some tinkering around that each post is encapsulated in a tag with the class name clickable.

Since everything is inside this one class, we are going to use a forEach loop to get the data inside them and get all the individual pieces separately.

So the code will look like this...

const puppeteer = require('puppeteer');

puppeteer.launch({ headless: true, args: ['--user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3312.0 Safari/537.36"'] }).then(async browser => {

	const page = await browser.newPage();
	await page.goto("https://weather.com/en-IN/weather/tenday/l/892a73a36fbd270bd1d5230d6b8a00a7da793b50b9878f50b49360ba0e072dc5");
	await page.waitForSelector('body');

    var rposts = await page.evaluate(() => {
       

        let posts = document.body.querySelectorAll('.clickable');       
        postItems = [];

        posts.forEach((item) => {

            let dayDetail = ''
            let description = ''
            let temp = ''
            let humidity = ''
            let wind = ''
            let precip = ''
            try{
             dayDetail = item.querySelector('.day-detail').innerText;
            if (dayDetail!=''){
                 description = item.querySelector('.description').innerText;
                 temp = item.querySelector('.temp').innerText;
                 precip = item.querySelector('.precip').innerText;
                 humidity = item.querySelector('.humidity').innerText;
                 wind = item.querySelector('.wind').innerText;
                 postItems.push({dayDetail: dayDetail, description: description, temp : temp, precip : precip, humidity: humidity, wind: wind});
            }
            }catch(e){

            }



        });


        
        var items = { 
            "posts": postItems
        };

        return items;
        
    });

    console.log(rposts);
    await browser.close();

}).catch(function(error) {
    console.error(error);
});

You can see how the with the day-detail class always has the day/date info so we fetch that. The query...

  • dayDetail = item.querySelector('.day-detail').innerText;
    

    Gets us the day/date into attached to each of the days. We put all of this in a try... catch... because some might not have a piece of info and might raise an error and break the code.

    If you run this it should print all the weather forecast info for the next 15 days like so...

    If you want to use this in production and want to scale to thousands of links then you will find that you will get IP blocked easily by Weather.com. In this scenario using a rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

    • With millions of high speed rotating proxies located all over the world.
    • With our automatic IP rotation.
    • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions).
    • With our automatic CAPTCHA solving technology.

    Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you dont even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so...

    curl "http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Share this article:

Get our articles in your inbox

Dont miss our best tips/tricks/tutorials about Web Scraping
Only great content, we don’t share your email with third parties.
Icon