sandcrawler.js The server-side scraping companion.

Spider


sandcrawler's spiders enable you to perform complex scraping tasks.

They aim at visiting series of urls in order to scrape the retrieved pages' contents.


Introduction

Spider methods

--Feeding

--Scraping

--Lifecycle

--Configuration

--Controls

Job specification

Conclusion


Basics

Here is how a spider works:

  • You must create one:
var sandcrawler = require('sandcrawler');

var spider = sandcrawler.spider('MySpiderName');
  • Then you must feed it with urls:
spider.urls([
  'http://url1.com',
  'http://url2.com'
]);
  • And specify the scraper they will use on those urls:
spider.scraper(function($, done) {
  done($('.yummy-data').scrape());
});
  • So you can do something with the results of the scraper:
spider.result(function(err, req, res) {
  console.log('Yummy data!', res.data);
});
  • Finally you must run the spider so it can start doing its job:
spider.run(function(err, remains) {
  console.log('Finished!');
});
  • Chained, it may look like this:
var spider = sandcrawler('MySpiderName')
  .urls([
    'http://url1.com',
    'http://url2.com'
  ])
  .scraper(function($, done) {
    done(null, $('.yummy-data').scrape());
  })
  .result(function(err, req, res) {
    console.log('Yummy data!', res.data);
  })
  .run(function(err, remains) {
    console.log('Finished!');
  });

Note that if you need to perform your scraping task in a phantom, you just need to change the spider type and it should work the same:

var spider = sandcrawler.phantomSpider();
// instead of
var spider = sancrawler.spider();

Be sure however to pay a visit to the Phantom Spider page of this documentation to avoid typical pitfalls.


spider.url

This method can be used to add a single job to your spider's queue.

A job, in its most simple definition, is a mere url but can be described by an object to inform the spider you need finer parameters.

spider.url(feed [, when]);

Arguments

  • feed string|object: either a string representing the url you need to hit, or a descriptive object containing the possible keys listed below.
  • when ?string ['later']: later or now to control where on the stack we should add the job.

Job descriptive object:

  • url string|object: the url you need to hit as a string or an object to be formatted by node's url module.
  • auth ?object: an object containing at least a user and optionally a password to authenticate through http.
  • body ?object|string: if bodyType is set to 'form', either a querystring or an object that will be formatted as a querystring. If bodyType is set to 'json', either a JSON string or an object that will be stringified.
  • bodyType ?string ['form']: either 'form' or 'json'.
  • cheerio ?object: options passed to cheerio when parsing the results.
  • cookies ?array: array of cookies to send with the request. Can be given as string or as an object that will be passed to tough-cookie.
  • data ?mixed: any arbitrary data, usually an object, you would need to attach to your job and pass along the spider for later user (a database id for instance).
  • headers ?object: object of custom headers to send with the request.
  • method ?string ['GET']: http method to use.
  • proxy ?string|object: a proxy for the request.
  • timeout ?integer [5000]: time in milliseconds to perform the job before triggering a timeout.

Examples

// String url
spider.url('http://nicesite.com');

// Url object
spider.url({
  port: 8000,
  hostname: 'nicesite.com'
});

// Job object
spider.url({
  url: {
    port: 8000,
    hostname: 'nicesite.com'
  },
  headers: {
    'User-Agent': 'The jawa avenger'
  },
  data: {
    id: 'nice1',
    location: './test/'
  }
});

spider.urls

Same as spider.url except you can pass an array of jobs.

spider.urls(feeds [, when]);

Examples

spider.urls([
  'http://nicesite.com',
  'http://prettysite.com'
]);

spider.urls([
  {url: 'http://nicesite.com', method: 'POST'},
  {url: 'http://prettysite.com', method: 'POST'}
]);

Note

Under the hood, spider.url and spider.urls are strictly the same. It's just a matter of convention and style to dissociate them.


spider.addUrl

Alias of spider.url.


spider.addUrls

Alias of spider.urls.


spider.iterate

This method takes a function returning the next url from the result of the last job or false if you want to stop.

spider.iterate(fn);

The given function will be passed the following arguments:

  • i integer: index of the current job.
  • req object: last job request.
  • res object: last job's response.

Example

// Spider starting on a single url and paginating from it
var spider = sandcrawler.spider()
  .url('http://nicesite.com')
  .iterate(function(i, req, res) {
    return res.data.nextUrl || false;
  })
  .scraper(function($, done) {
    done(null, {nextUrl: $('.next-page').attr('href')});
  });

// This is roughly the same as adding the next url at runtime
var spider = sandcrawler.spider()
  .url('http://nicesite.com')
  .scraper(function($, done) {
    done(null, {nextUrl: $('.next-page').attr('href')});
  })
  .result(function(err, req, res) {
    if (!err && res.data.nextUrl)
      this.addUrl(res.data.nextUrl);
  });

spider.scraper

This method registers the spider's scraping function.

spider.scraper(fn);

This function will be given the following arguments:

  • $: the retrieved html loaded into cheerio and extended with artoo.scrape.
  • done: a callback to call when your scraping is done. This function is a typical node.js callback and takes as first argument an error if needed and the scraped data as second argument.

Example

// Simplistic example to retrieve the page's title
spider.scraper(function($, done) {
  done(null, $('title').text());
});

Note

Any error thrown within this function will fail the current job but won't exit the process.


spider.scraperSync

Synchronous version of spider.scraper.

spider.scraperSync(fn);

This function will be given the following argument:

Example

// Simplistic example to retrieve the page's title
spider.scraper(function($) {
  return $('title').text();
});

Note

Any error thrown within this function will fail the current job but won't exit the process.


spider.result

Method accepting a callback dealing with jobs' results.

spider.result(fn);

This function will be given the following arguments:

  • err: a potential error that occurred during the job's scraping process.
  • req: the job's request you passed.
  • res: the resultant response.

Example

spider.result(function(err, req, res) {
  if (err) {
    console.log('Oh, no! An error!', err);
  }
  else {
    saveInDatabase(res.data);
  }
});

Retries

Note that within the result callback, you are given the opportunity to retry failed jobs.

There are three req method that you can use to do so:

  • req.retry/req.retryLater: put the failed job at the end of the spider's queue so it can be retried later.
  • req.retryNow: put the failed job at the front of the spider's queue so it can be retried immediately.
spider.result(function(err, req, res) {
  if (err) {
    // Our job failed, let's retry now!
    req.retryNow();
  }
});

Note also that you can set a maxRetries setting not to be trapped within an infinite loop of failures.


spider.before

Register a middleware applying before the spider starts its job queue.

This is useful if you need to perform tasks like logging into a website before being able to perform your scraping tasks.

spider.before(fn);

The function passed will be given the following argument:

  • next: the function to call with an optional error if failed. Note that if such an error is passed when applying before middlewares, then the spider will fail globally.

Example

// Checking whether our database is available before launching the spider
var spider = sandcrawler.spider()
  .url('http://nicesite.com')
  .before(function(next) {
    if (databaseAvailable())
      return next();
    else
      return next(new Error('database-not-available'));
  });

sandcrawler.run(spider, function(err) {
  // database-not-available error here if our middleware failed
});

spider.beforeScraping

Register a middleware applying before the spider attempts to perform a scraping job.

This gives you the opportunity to discard a job before it's even performed.

The function passed will be given the following arguments:

  • req: the request about to be passed.
  • next: the function to call with an optional error if failed. Note that if such an error is passed when applying beforeScraping middlewares, then the job will be discarded.

Example

// Checking whether we already scraped the url before
var scrapedUrls = {};

var spider = sandcrawler.spider()
  .url('http://nicesite.com')
  .beforeScraping(function(req, next) {
    if (scrapedUrls[req.url]) {
      return next(new Error('already-scraped'));
    }
    else {
      scrapedUrls[req.url] = true;
      return next();
    }
  });

spider.afterScraping

Register a middleware applying after the spider has performed a scraping job.

The function passed will be given the following arguments:

  • req: the passed request.
  • res: the resultant response.
  • next: the function to call with an optional error if failed. Note that if such an error is passed when applying afterScraping middlewares, then the job will be failed.

Example

// Validate the retrieved data
var spider = sandcrawler.spider()
  .url('http://nicesite.com')
  .scraperSync(function($) {
    return $('.title').scrape({
      title: 'text',
      href: 'href'
    });
  })
  .afterScraping(function(req, res, next) {
    if (!res.data.title || !res.data.href)
      return next(new Error('invalid-data'));
    else
      return next();
  });

spider.on/etc.

Every spider is a standard node event emitter.

This means you can use any of the event emitters methods like on or removeListener.

For more information about the events you can listen, you should head towards the lifecycle part of this documentation.


spider.config

This method can be used to tune the spider's configuration.

spider.config(object);

Options

  • auth ?object: an object containing at least a user and optionally a password to authenticate through http.
  • autoRetry [false]: should the spider attempt to retry failed job on its own?
  • body ?object|string: if bodyType is set to 'form', either a querystring or an object that will be formatted as a querystring. If bodyType is set to 'json', either a JSON string or an object that will be stringified.
  • bodyType ?string ['form']: either 'form' or 'json'.
  • concurrency integer [1]: number of jobs to perform at the same time.
  • cheerio ?object: options passed to cheerio when parsing the results.
  • cookies ?array: array of cookies to send with the request. Can be given as string or as an object that will be passed to tough-cookie.
  • headers ?object: object of custom headers to send with the request.
  • jar ?boolean|object|string: if true the spider will keep the received cookies to use them in further requests. Can also take a path where cookies will be store thanks to tough-cookie-filestore so you can re-use them later. Finally, can take a tough-cookie or request jar object (note that you can also access a spider's jar through spider.jar.
  • limit ?integer: max number of jobs to perform.
  • maxRetries ?integer  [3]: max number of times one can retry a job.
  • method ?string ['GET']: http method to use.
  • proxy ?string|object: a proxy to use for the requests.
  • timeout ?integer [5000]: time in milliseconds to perform the jobs before triggering a timeout.

Example

spider.config({
  proxy: 'http://my-proxy.fr',
  method: 'POST',
  timeout: 50 * 1000
});

spider.timeout

spider.timeout(milliseconds);

Shorthand for:

spider.config({timeout: milliseconds});

spider.limit

spider.limit(nb);

Shorthand for:

spider.config({limit: nb});

spider.validate

Gives you an opportunity to validate the scraped data before the result callback.

spider.validate(spec);

Argument

  • spec typologyDefinition|function: either a typology definition or a custom function taking as sole argument the scraped data.

Examples

// The scraped data must be a string or a number
spider.validate('string|number');

// The scraped title must be at least 5 characters
spider.validate(function(data) {
  return data.length >= 5;
});

Under the hood, this method registers a validation afterScraping middleware.


spider.throttle

Delay the scraping process of each job by the given amount of time (this is particularily helpful when you don't want to hit on servers too hard and avoid being kicked out for being too obvious in your endeavours).

spider.throttle(milliseconds);
spider.throttle(minMilliseconds, maxMilliseconds);

Either the job will be delayed for the given time in milliseconds or you can pass a minimum and a maximum, also in milliseconds, to randomize the throttling.


spider.use

Makes the spider use the given plugin.

spider.use(plugin);

For more information about plugins, you should head towards this section of the documentation.


spider.run

Starts the spider.

spider.run(callback);

Takes a single callback taking the following arguments:

  • err error: a JavaScript error if the spider failed globally.
  • remains array: an array consisting of every failed jobs and their associated errors.

Example

spider.run(function(err, remains) {
  if (err)
    console.log('The spider failed:', err.message);
  else
    console.log(remains.length + ' jobs failed.');
});

Note also that spider.run is an alias to sandcrawler.run(spider).

spider.run(function(err, remains) {
  //...
});

// is exactly the same as
sandcrawler.run(spider, function(err, remains) {
  //...
});

spider.pause

Pauses the spider execution.


spider.resume

Resumes the spider execution.


spider.exit

Exits the spider and fails every jobs in the queue.

spider.result(function(err, req, res) {
  this.exit();
});

sandcrawler.run(function(err, remains) {
  // err.message will be 'exited'
});

job

Spiders materialize their scraping processes as jobs so they can track them and provide their users with useful information.

For instance, every url fed to a spider will be translated, in the spider's lifecycle, as a job having the following keys:

Keys

  • id: the job's unique id.
  • original: the exact feed you passed to the spider and that was used to create the job.
  • req: the job's request.
  • res: the job's response.
  • state: several state-related flag like whether the job is failing etc.
  • time: an object containing a start and stop node process hrtime.

job.req

The job's request object.

Keys

  • data: any arbitrary data attached by the user to the request.
  • retries: number of time the request was already retried.
  • url: the requested url (may differ from the eventually hit url).

And any other keys set by the user while adding the job through the url method.


job.res

The job's response object.

Keys

  • body: body of the response.
  • data: data as returned by the scraping function supplied to the spider.
  • error: an eventual error if the job failed.
  • headers: any response headers.
  • status: http status code.
  • url: the eventually hit url (after redirections, for instance).

Bonus

If you do not fancy spiders and believe they are a creepy animal that should be shunned, you remain free to use less fearsome names such as droid or jawa:

var spider = sandcrawler.spider();
// is the same as
var droid = sandcrawler.droid();
// and the same as
var jawa = sandcrawler.jawa();