Lifecycle

When doing their work, sandcrawler spiders abide by a precise lifecyle and will emit events each time they do something so that anyone can hook onto it to track them and implement custom logic.

Example

spider.on('spider:start', function() {
  console.log('The spider has started.');
});

spider.on('job:success', function(job) {
  console.log('Success for url:', job.res.url);
});

If you ever wonder know what's inside the job objects passed around most of job-level events, be sure to check out this part of the documentation first.

Spider-level events

spider:start
spider:teardown
spider:success
spider:fail
spider:end

Job-level events

job:add
job:discard
job:start
job:scrape
job:success
job:fail
job:retry
job:end

spider:start

Emitted when the spider starts.

spider:teardown

Emitted when the spider tears down. This is useful to plugins needing to cleanup when the hooked spider finishes its work.

spider:success

Emitted when the spider succeeds. Note that the spider can succeed even if some jobs did not. Indeed, the spider will only considered as failed if a global error occurred while running.

Data

remains: array of unsuccessful jobs along with their related errors.

spider:fail

Emitted when the spider fails globally.

Data

err: the culprit.
remains: array of unsuccessful jobs along with their related errors.

spider:end

Emitted when the spider ends, whether it succeeded or failed.

Data

status: either success or fail.
remains: array of unsuccessful jobs along with their related errors.

job:add

Emitted when a job is added to the spider's queue when running.

Data

job: the related job.

job:discard

Emitted when a job is discarded from the job's queue because it was rejected by a beforeScraping middleware.

Data

err: the error that lead to the job being discarded.
job: the related job.

job:start

Emitted when the spider starts processing a job.

Data

job: the related job.

job:scrape

Emitted when the spider starts scraping a job.

Data

job: the related job.

job:success

Emitted when a job succeeds.

Data

job: the related job.

job:fail

Emitted when a job fails.

Data

err: the related error.
job: the related job.

job:retry

Emitted when a job is retried.

Data

job: the related job.
when: either now or later.

job:end

Emitted when a job ends, whether it succeeded or failed.

Data

status: either success or fail.
job: the related job.

sandcrawler.js The server-side scraping companion.

Lifecycle

spider:start

spider:teardown

spider:success

spider:fail

spider:end

job:add

job:discard

job:start

job:scrape

job:success

job:fail

job:retry

job:end