How to Scrape Weather Data Using Python Scrapy

May 6th, 2020

Scrapy is one of the most accessible tools that you can use to scrape and also spider a website with effortless ease.

Today lets see how we can scrape weather data from the internet.

Here is the URL we are going to scrape https://weather.com/en-IN/weather/tenday/l/6d031a57074ba2aebf48f086cb118df52748edf41d9c624fd95329c6e070754d which provides a 10-day forecast for San Fransisco.

First, we need to install scrapy if you haven't already.

pip install scrapy

Once installed, go ahead and create a project by invoking the start project command.

scrapy startproject scrapingproject

This will output something like this.

New Scrapy project 'scrapingproject', using template directory '/Library/Python/2.7/site-packages/scrapy/templates/project', created in:
    /Applications/MAMP/htdocs/scrapy_examples/scrapingproject

You can start your first spider with:
    cd scrapingproject
    scrapy genspider example example.com

And create a folder structure like this.

Now CD into the scrapingproject. You will need to do it twice like this.

cd scrapingproject
cd scrapingproject

Now we need a spider to crawl through the Weather.com page. So we use the genspider to tell scrapy to create one for us. We call the spider ourfirstbot and pass it to the URL of the Weather.com page.

scrapy genspider ourfirstbot https://weather.com/en-IN/weather/tenday/l/6d031a57074ba2aebf48f086cb118df52748edf41d9c624fd95329c6e070754d

This should return successfully like this.

Created spider 'ourfirstbot' using template 'basic' in module:
  scrapingproject.spiders.ourfirstbot

Great. Now open the file ourfirstbot.py in the spider's folder. It should look like this.

# -*- coding: utf-8 -*-
import scrapy


class OurfirstbotSpider(scrapy.Spider):
    name = 'ourfirstbot'
    start_urls = ['https://weather.com/en-IN/weather/tenday/l/6d031a57074ba2aebf48f086cb118df52748edf41d9c624fd95329c6e070754d']

    def parse(self, response):
        pass

Let's examine this code before we proceed.

The allowed_domains array restricts all further crawling to the domain paths specified here.

start_urls is the list of URLs to crawl. For us, in this example, we only need one URL.

The def parse(self, response): function is called by scrapy after every successful URL crawl. Here is where we can write our code to extract the data we want.

We now need to find the CSS selector of the elements we need to extract the data. Go to the URL weather.com and right-click on the title of one of the date portion of the weather and click on inspecting. This will open the Google Chrome Inspector like below.

You can see that the CSS class name of the title element is day-detail, so we are going to ask scrapy to get us the contents of this class like this.

dates = response.css('.day-detail').extract()

Similarly, we try and find the class names of the temperature element and the precipitation element and so on (note that the class names might change by the time you run this code.

dates = response.css('.day-detail').extract()       
        descriptions = response.css('.description').extract()
        temps = response.css('.temp').extract()
        precipitations = response.css('.precip').extract()
        winds = response.css('.wind').extract()
        humiditys = response.css('.humidity').extract()

If you are unfamiliar with CSS selectors, you can refer to this page by Scrapy https://docs.scrapy.org/en/latest/topics/selectors.html

We have to now use the zip function to map a similar index of multiple containers so that they can be used just using as a single entity. So here is how it looks.

# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
import urllib


class OurfirstbotSpider(scrapy.Spider):
    name = 'ourfirstbot'
    start_urls = [
        'https://weather.com/en-IN/weather/tenday/l/6d031a57074ba2aebf48f086cb118df52748edf41d9c624fd95329c6e070754d',
    ]

    def parse(self, response):
        #yield response
        dates = response.css('.day-detail').extract()       
        descriptions = response.css('.description').extract()
        temps = response.css('.temp').extract()
        precipitations = response.css('.precip').extract()
        winds = response.css('.wind').extract()
        humiditys = response.css('.humidity').extract()
        #links = response.css('.css-8atqhb a::attr(href)').extract()
        
        for item in zip(dates, descriptions, temps, precipitations, winds, humiditys):
            all_items = {
                'date' : BeautifulSoup(item[0]).text,
                'description' : BeautifulSoup(item[1]).text,
                'temp' : BeautifulSoup(item[2]).text,
                'precipitation' : BeautifulSoup(item[3]).text,
                'wind' : BeautifulSoup(item[4]).text,
                'humidity' : BeautifulSoup(item[5]).text,

            }


            yield all_items

We use BeautifulSoup to remove HTML tags and get pure text and now lets run this with the command (Notice we are turning off obeying Robots.txt)

scrapy crawl ourfirstbot -s ROBOTSTXT_OBEY=False

Bingo. You get the results below.

Now, let's export the extracted data to a CSV file. All you have to do is to provide an export file like this.

scrapy crawl ourfirstbot -o data.csv

Or if you want the data in the JSON format.

scrapy crawl ourfirstbot -o data.json

Scaling Scrapy

The example above is ok for small scale web crawling projects. But if you try to scrape large quantities of data at high speeds from websites like Weather.com, you will find that sooner or later, your access will be restricted. Weather.com can tell you are a bot, so one of the things you can do is run the crawler impersonating a web browser. This is done by passing the user agent string to the Weather.com web server, so it doesn't block you.

Like this

scrapy crawl ourfirstbot -s USER_AGENT="Mozilla/5.0 (Windows NT 6.1; WOW64)/
 AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36" /
-s ROBOTSTXT_OBEY=False

In more advanced implementations, you will need to even rotate this string, so Weather.com cant tell it the same browser! Welcome to web scraping.

If we get a little bit more advanced, you will realize that Weather.com can simply block your IP, ignoring all your other tricks. This is a bummer, and this is where most web crawling projects fail.

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project, which gets the job done consistently and one that never really works.

Plus, with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world
With our automatic IP rotation
With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
With our automatic CAPTCHA solving technology

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

A simple API can access the whole thing like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Once you have an API_KEY from Proxies API, you just have to change your code to this.

# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
import urllib


class OurfirstbotSpider(scrapy.Spider):
    name = 'ourfirstbot'
    start_urls = [
        'http://api.proxiesapi.com/?key=API_KEY&url=https://weather.com/en-IN/weather/tenday/l/6d031a57074ba2aebf48f086cb118df52748edf41d9c624fd95329c6e070754d',
    ]

    def parse(self, response):
        #yield response
        dates = response.css('.day-detail').extract()       
        descriptions = response.css('.description').extract()
        temps = response.css('.temp').extract()
        precipitations = response.css('.precip').extract()
        winds = response.css('.wind').extract()
        humiditys = response.css('.humidity').extract()
        #links = response.css('.css-8atqhb a::attr(href)').extract()
        
        for item in zip(dates, descriptions, temps, precipitations, winds, humiditys):
            all_items = {
                'date' : BeautifulSoup(item[0]).text,
                'description' : BeautifulSoup(item[1]).text,
                'temp' : BeautifulSoup(item[2]).text,
                'precipitation' : BeautifulSoup(item[3]).text,
                'wind' : BeautifulSoup(item[4]).text,
                'humidity' : BeautifulSoup(item[5]).text,

            }


            yield all_items

We have only changed one line at the start_urls array, and that will make sure we will never have to worry about IP rotation, user agent string rotation, or even rate limits ever again.

Get our articles in your inbox