Jan 6th, 2021

Scrapy in Action

Write sophisticated spiders

It is a breeze to write full-blown spiders quickly with Scrapy. Here is one that can download all the images from a Wikipedia page.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

i=1

class MySpider(CrawlSpider):
    name = 'Wikipedia'
    allowed_domains = ['en.wikipedia.org', 'upload.wikimedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Lists_of_animals']
    rules = (
        # This rule set makes sure it downloads anything with the extension .jpg in it and also removes the deny_extensions default setting 

        Rule(LinkExtractor(allow=('.jpg'), deny_extensions=set(), tags = ('img',), attrs=('src',), canonicalize = True, unique = True), follow = False, callback='parse_item'),

    )

    def parse_item(self, response):
        global i
        i=i 1
        self.logger.info('Found image - %s', response.url)
        flname='image' str(i) '.jpg'
        with open('image' str(i) '.jpg', 'wb') as html_file:
            html_file.write(response.body)

        self.logger.info('Saved image as - %s', flname)
        item = scrapy.Item()
        return item

Here is another that can navigate and get quotes from http://quotes.toscrape.com by following the pagination

import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Use selectors to extract content

The example above uses both CSS and XPath selectors to extract text easily. Between them, you can extract any imaginable content on the web. Plus Scrapy makes it easy to test the selectors interactively in the Scrapy Shell

Do Interactive testing in Scrapy Shell

It is one of my favorite things about Scrapy. One of the most time-consuming things is writing the correct selectors that get you the data you want. The fastest way to test and iterate through this process is by using the interactive shell. You can invoke it like this.

scrapy shell http://example.com

It loads the contents of example.com into the response object, which you can now query like so.

response.xpath('//title/text()')

This will print the title of the page as a result.

[<Selector xpath='//title/text()' data=u'Example Domain'>]

To get the Headline of the page, you can use the CSS selector.

response.css('h1::text').get()

This will print

Out[10]: u'Example Domain'

Export data in many ways and store it in different systems

After you have successfully run your spiders, extracted the data you will need to export that data into all sorts of formats depending on your needs plus you may have to save it into various locations like the local disk or Amason S3 etc.
Scrapy comes inbuilt with the following export formats:
a. JSON
b. JSONLINES
c. CSV
d. XML
e. Pickle
f. Marshal
and you can save the data too.
g. Local storage
h. Amazon S3
i. FTP
j. Standard output

Use the Signals API to get notified when certain events occur

The Signals API is super useful in controlling, monitoring, and reporting the behavior of your spiders.

The following code demonstrates how you can subscribe to the Spider_closed event.

from scrapy import signals
from scrapy import Spider


class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
    ]


    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
        return spider


    def spider_closed(self, spider):
        spider.logger.info('Spider closed: %s', spider.name)


    def parse(self, response):
        pass

The author is the founder of Proxies API the rotating proxies service

Get our articles in your inbox