Jan 6th, 2021


Scrapy in Action

Write sophisticated spiders

It is a breeze to write full-blown spiders quickly with Scrapy. Here is one that can download all the images from a Wikipedia page.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

i=1

class MySpider(CrawlSpider):
name = 'Wikipedia'
allowed_domains = ['en.wikipedia.org', 'upload.wikimedia.org']
start_urls = ['https://en.wikipedia.org/wiki/Lists_of_animals']
rules = (
# This rule set makes sure it downloads anything with the extension .jpg in it and also removes the deny_extensions default setting

Rule(LinkExtractor(allow=('.jpg'), deny_extensions=set(), tags = ('img',), attrs=('src',), canonicalize = True, unique = True), follow = False, callback='parse_item'),

)

def parse_item(self, response):
global i
i=i 1
self.logger.info('Found image - %s', response.url)
flname='image' str(i) '.jpg'
with open('image' str(i) '.jpg', 'wb') as html_file:
html_file.write(response.body)

self.logger.info('Saved image as - %s', flname)
item = scrapy.Item()
return item

Here is another that can navigate and get quotes from http://quotes.toscrape.com by following the pagination

import scrapy


class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.xpath('span/small/text()').get(),
}

next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)

Use selectors to extract content

The example above uses both CSS and XPath selectors to extract text easily. Between them, you can extract any imaginable content on the web. Plus Scrapy makes it easy to test the selectors interactively in the Scrapy Shell

Do Interactive testing in Scrapy Shell

It is one of my favorite things about Scrapy. One of the most time-consuming things is writing the correct selectors that get you the data you want. The fastest way to test and iterate through this process is by using the interactive shell. You can invoke it like this.

scrapy shell http://example.com

It loads the contents of example.com into the response object, which you can now query like so.

response.xpath('//title/text()')

This will print the title of the page as a result.

[<Selector xpath='//title/text()' data=u'Example Domain'>]

To get the Headline of the page, you can use the CSS selector.

response.css('h1::text').get()

This will print

Out[10]: u'Example Domain'

Export data in many ways and store it in different systems

After you have successfully run your spiders, extracted the data you will need to export that data into all sorts of formats depending on your needs plus you may have to save it into various locations like the local disk or Amason S3 etc.

Scrapy comes inbuilt with the following export formats:

a. JSON

b. JSONLINES

c. CSV

d. XML

e. Pickle

f. Marshal

and you can save the data too.

g. Local storage

h. Amazon S3

i. FTP

j. Standard output

Use the Signals API to get notified when certain events occur

The Signals API is super useful in controlling, monitoring, and reporting the behavior of your spiders.

The following code demonstrates how you can subscribe to the Spider_closed event.

from scrapy import signals
from scrapy import Spider


class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
]


@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider


def spider_closed(self, spider):
spider.logger.info('Spider closed: %s', spider.name)


def parse(self, response):
pass

The author is the founder of Proxies API the rotating proxies service


Share this article:

Get our articles in your inbox

Dont miss our best tips/tricks/tutorials about Web Scraping
Only great content, we don’t share your email with third parties.
Icon