sandcrawler.js The server-side scraping companion.

Lifecycle


When doing their work, sandcrawler spiders abide by a precise lifecyle and will emit events each time they do something so that anyone can hook onto it to track them and implement custom logic.

Example

spider.on('spider:start', function() {
  console.log('The spider has started.');
});

spider.on('job:success', function(job) {
  console.log('Success for url:', job.res.url);
});

If you ever wonder know what's inside the job objects passed around most of job-level events, be sure to check out this part of the documentation first.


Spider-level events

Job-level events


spider:start

Emitted when the spider starts.


spider:teardown

Emitted when the spider tears down. This is useful to plugins needing to cleanup when the hooked spider finishes its work.


spider:success

Emitted when the spider succeeds. Note that the spider can succeed even if some jobs did not. Indeed, the spider will only considered as failed if a global error occurred while running.

Data

  • remains: array of unsuccessful jobs along with their related errors.

spider:fail

Emitted when the spider fails globally.

Data

  • err: the culprit.
  • remains: array of unsuccessful jobs along with their related errors.

spider:end

Emitted when the spider ends, whether it succeeded or failed.

Data

  • status: either success or fail.
  • remains: array of unsuccessful jobs along with their related errors.

job:add

Emitted when a job is added to the spider's queue when running.

Data

  • job: the related job.

job:discard

Emitted when a job is discarded from the job's queue because it was rejected by a beforeScraping middleware.

Data

  • err: the error that lead to the job being discarded.
  • job: the related job.

job:start

Emitted when the spider starts processing a job.

Data

  • job: the related job.

job:scrape

Emitted when the spider starts scraping a job.

Data

  • job: the related job.

job:success

Emitted when a job succeeds.

Data

  • job: the related job.

job:fail

Emitted when a job fails.

Data

  • err: the related error.
  • job: the related job.

job:retry

Emitted when a job is retried.

Data

  • job: the related job.
  • when: either now or later.

job:end

Emitted when a job ends, whether it succeeded or failed.

Data

  • status: either success or fail.
  • job: the related job.