May 5th, 2020
Recursively Scraping Webpages with Scrapy

One of the oft-asked for starting points in Web Crawling is how do I download all the pages in a given website. This could be for SEO purposes, studying competitor websites, or just a general curiosity about programming crawlers.

So let's see how we can get a necessary code going, which does exactly that and all the little nuances that we need to take care of along the way.

Let's see what we need to load up first.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

We need scrapy loaded up and the CrawSpider module rather than just the spider module.

Rules, along with a linkExtractor to easily find and follow links

So a barebones setup would look like this.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class TheFriendlyNeighbourhoodSpider(CrawlSpider):
    name = 'TheFriendlyNeighbourhoodSpider'

    allowed_domains = ['en.wikipedia.org', 'upload.wikimedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Lists_of_animals']

		custom_settings = {
    'LOG_LEVEL': 'INFO'
    }

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )


    def parse_item(self, response):
        print('Downloaded... ' response.url)
        filename = 'storage/' response.url.split("/")[-1]   '.html'
        print('Saving as :' filename)
        with open(filename, 'wb') as f:
            f.write(response.body)

We are setting the starturls and restricting the domains to Wikipedia. The rules tell the linkExtractor to simply get all links and follow them. The callback to parse_item helps us save the data downloaded by the spider.

The parse_item function simply gets the filename and saves it into the Storage folder.

Let's save this file as TheFriendlyNeighbourhoodSpider.py.

Make sure you create a folder called storage to catch all the files downloaded by the spider.

And run it with.

scrapy runspider TheFriendlyNeighbourhoodSpider.py

Which should give you.

A bunch of HTML files saved in the storage folder.

Great! Now let's optimize the settings of this spider so it can work faster and more reliable.

For which we will use the settings like this.

custom_settings = {
    'CONCURRENT_REQUESTS': 10,
    'CONCURRENT_REQUESTS_PER_DOMAIN': 25,
    'ROBOTSTXT_OBEY': False,
    'CONCURRENT_ITEMS': 100,
    'REACTOR_THREADPOOL_MAXSIZE': 400,
    # Hides printing item dicts
    'LOG_LEVEL': 'INFO',
    'RETRY_ENABLED': False,
    'REDIRECT_MAX_TIMES': 1,
    # Stops loading page after 5mb
    'DOWNLOAD_MAXSIZE': 5592405,
    # Grabs xpath before site finish loading
    'DOWNLOAD_FAIL_ON_DATALOSS': False,

    'DEPTH_PRIORITY': 1,
    'SCHEDULER_DISK_QUEUE' : 'scrapy.squeues.PickleFifoDiskQueue',
    'SCHEDULER_MEMORY_QUEUE' :'scrapy.squeues.FifoMemoryQueue'
    }

The CONCURRENT_REQUESTS settings make it ten times faster, and we disable Robots.txt for the moment with ROBOTSTXT_OBEY set to false.

We also avoid getting stuck with large unexpected downloads with the DOWNLOAD_MAXSIZE set to around 5mb.

Scrapy uses the Depth First method of crawling by default. We can change that to Breadth First by setting this setting.

    'DEPTH_PRIORITY': 1,
    'SCHEDULER_DISK_QUEUE' : 'scrapy.squeues.PickleFifoDiskQueue',
    'SCHEDULER_MEMORY_QUEUE' :'scrapy.squeues.FifoMemoryQueue'

This helps in prioritizing items of the same depth level and thus preventing memory usage from bloating.

Now when you run it, the fetching times should be significantly faster. You can make it still faster by increasing the CONCURRENT_REQUESTS limit to whatever your system and your network can handle.

If you want to use this in production and want to scale to thousands of links, then you will find that you will get IP blocked quickly by many websites. In this scenario, using a rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

A simple API can access the whole thing like below in any programming language.

You dont even have to take the pain of loading Puppeteer as we render Javascript behind the scenes, and you can just get the data and parse it any language like Node, Puppeteer, or PHP or using any framework like Scrapy or Nutch. In all these cases, you can just call the URL with render support like so.

curl "http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Share this article:

Get our articles in your inbox

Dont miss our best tips/tricks/tutorials about Web Scraping
Only great content, we don’t share your email with third parties.
Icon