sandcrawler's spiders enable you to perform complex scraping tasks.
They aim at visiting series of urls in order to scrape the retrieved pages' contents.
Introduction
Spider methods
--Feeding
--Scraping
--Lifecycle
--Configuration
--Controls
Job specification
Conclusion
Here is how a spider works:
var sandcrawler = require('sandcrawler');
var spider = sandcrawler.spider('MySpiderName');
spider.urls([
'http://url1.com',
'http://url2.com'
]);
spider.scraper(function($, done) {
done($('.yummy-data').scrape());
});
spider.result(function(err, req, res) {
console.log('Yummy data!', res.data);
});
spider.run(function(err, remains) {
console.log('Finished!');
});
var spider = sandcrawler('MySpiderName')
.urls([
'http://url1.com',
'http://url2.com'
])
.scraper(function($, done) {
done(null, $('.yummy-data').scrape());
})
.result(function(err, req, res) {
console.log('Yummy data!', res.data);
})
.run(function(err, remains) {
console.log('Finished!');
});
Note that if you need to perform your scraping task in a phantom, you just need to change the spider type and it should work the same:
var spider = sandcrawler.phantomSpider();
// instead of
var spider = sancrawler.spider();
Be sure however to pay a visit to the Phantom Spider page of this documentation to avoid typical pitfalls.
This method can be used to add a single job to your spider's queue.
A job, in its most simple definition, is a mere url but can be described by an object to inform the spider you need finer parameters.
spider.url(feed [, when]);
Arguments
'later'
]: later
or now
to control where on the stack we should add the job.Job descriptive object:
user
and optionally a password
to authenticate through http.bodyType
is set to 'form'
, either a querystring or an object that will be formatted as a querystring. If bodyType
is set to 'json'
, either a JSON string or an object that will be stringified.'form'
]: either 'form'
or 'json'
.'GET'
]: http method to use.5000
]: time in milliseconds to perform the job before triggering a timeout.Examples
// String url
spider.url('http://nicesite.com');
// Url object
spider.url({
port: 8000,
hostname: 'nicesite.com'
});
// Job object
spider.url({
url: {
port: 8000,
hostname: 'nicesite.com'
},
headers: {
'User-Agent': 'The jawa avenger'
},
data: {
id: 'nice1',
location: './test/'
}
});
Same as spider.url
except you can pass an array of jobs.
spider.urls(feeds [, when]);
Examples
spider.urls([
'http://nicesite.com',
'http://prettysite.com'
]);
spider.urls([
{url: 'http://nicesite.com', method: 'POST'},
{url: 'http://prettysite.com', method: 'POST'}
]);
Note
Under the hood, spider.url
and spider.urls
are strictly the same. It's just a matter of convention and style to dissociate them.
Alias of spider.url
.
Alias of spider.urls
.
This method takes a function returning the next url from the result of the last job or false
if you want to stop.
spider.iterate(fn);
The given function will be passed the following arguments:
Example
// Spider starting on a single url and paginating from it
var spider = sandcrawler.spider()
.url('http://nicesite.com')
.iterate(function(i, req, res) {
return res.data.nextUrl || false;
})
.scraper(function($, done) {
done(null, {nextUrl: $('.next-page').attr('href')});
});
// This is roughly the same as adding the next url at runtime
var spider = sandcrawler.spider()
.url('http://nicesite.com')
.scraper(function($, done) {
done(null, {nextUrl: $('.next-page').attr('href')});
})
.result(function(err, req, res) {
if (!err && res.data.nextUrl)
this.addUrl(res.data.nextUrl);
});
This method registers the spider's scraping function.
spider.scraper(fn);
This function will be given the following arguments:
Example
// Simplistic example to retrieve the page's title
spider.scraper(function($, done) {
done(null, $('title').text());
});
Note
Any error thrown within this function will fail the current job but won't exit the process.
Synchronous version of spider.scraper
.
spider.scraperSync(fn);
This function will be given the following argument:
Example
// Simplistic example to retrieve the page's title
spider.scraper(function($) {
return $('title').text();
});
Note
Any error thrown within this function will fail the current job but won't exit the process.
Method accepting a callback dealing with jobs' results.
spider.result(fn);
This function will be given the following arguments:
Example
spider.result(function(err, req, res) {
if (err) {
console.log('Oh, no! An error!', err);
}
else {
saveInDatabase(res.data);
}
});
Retries
Note that within the result callback, you are given the opportunity to retry failed jobs.
There are three req
method that you can use to do so:
spider.result(function(err, req, res) {
if (err) {
// Our job failed, let's retry now!
req.retryNow();
}
});
Note also that you can set a maxRetries
setting not to be trapped within an infinite loop of failures.
Register a middleware applying before the spider starts its job queue.
This is useful if you need to perform tasks like logging into a website before being able to perform your scraping tasks.
spider.before(fn);
The function passed will be given the following argument:
before
middlewares, then the spider will fail globally.Example
// Checking whether our database is available before launching the spider
var spider = sandcrawler.spider()
.url('http://nicesite.com')
.before(function(next) {
if (databaseAvailable())
return next();
else
return next(new Error('database-not-available'));
});
sandcrawler.run(spider, function(err) {
// database-not-available error here if our middleware failed
});
Register a middleware applying before the spider attempts to perform a scraping job.
This gives you the opportunity to discard a job before it's even performed.
The function passed will be given the following arguments:
beforeScraping
middlewares, then the job will be discarded.Example
// Checking whether we already scraped the url before
var scrapedUrls = {};
var spider = sandcrawler.spider()
.url('http://nicesite.com')
.beforeScraping(function(req, next) {
if (scrapedUrls[req.url]) {
return next(new Error('already-scraped'));
}
else {
scrapedUrls[req.url] = true;
return next();
}
});
Register a middleware applying after the spider has performed a scraping job.
The function passed will be given the following arguments:
afterScraping
middlewares, then the job will be failed.Example
// Validate the retrieved data
var spider = sandcrawler.spider()
.url('http://nicesite.com')
.scraperSync(function($) {
return $('.title').scrape({
title: 'text',
href: 'href'
});
})
.afterScraping(function(req, res, next) {
if (!res.data.title || !res.data.href)
return next(new Error('invalid-data'));
else
return next();
});
Every spider is a standard node event emitter.
This means you can use any of the event emitters methods like on
or removeListener
.
For more information about the events you can listen, you should head towards the lifecycle part of this documentation.
This method can be used to tune the spider's configuration.
spider.config(object);
Options
user
and optionally a password
to authenticate through http.false
]: should the spider attempt to retry failed job on its own?bodyType
is set to 'form'
, either a querystring or an object that will be formatted as a querystring. If bodyType
is set to 'json'
, either a JSON string or an object that will be stringified.'form'
]: either 'form'
or 'json'
.1
]: number of jobs to perform at the same time.true
the spider will keep the received cookies to use them in further requests. Can also take a path where cookies will be store thanks to tough-cookie-filestore so you can re-use them later. Finally, can take a tough-cookie or request jar object (note that you can also access a spider's jar through spider.jar
.3
]: max number of times one can retry a job.'GET'
]: http method to use.5000
]: time in milliseconds to perform the jobs before triggering a timeout.Example
spider.config({
proxy: 'http://my-proxy.fr',
method: 'POST',
timeout: 50 * 1000
});
spider.timeout(milliseconds);
Shorthand for:
spider.config({timeout: milliseconds});
spider.limit(nb);
Shorthand for:
spider.config({limit: nb});
Gives you an opportunity to validate the scraped data before the result callback.
spider.validate(spec);
Argument
Examples
// The scraped data must be a string or a number
spider.validate('string|number');
// The scraped title must be at least 5 characters
spider.validate(function(data) {
return data.length >= 5;
});
Under the hood, this method registers a validation afterScraping
middleware.
Delay the scraping process of each job by the given amount of time (this is particularily helpful when you don't want to hit on servers too hard and avoid being kicked out for being too obvious in your endeavours).
spider.throttle(milliseconds);
spider.throttle(minMilliseconds, maxMilliseconds);
Either the job will be delayed for the given time in milliseconds or you can pass a minimum and a maximum, also in milliseconds, to randomize the throttling.
Makes the spider use the given plugin.
spider.use(plugin);
For more information about plugins, you should head towards this section of the documentation.
Starts the spider.
spider.run(callback);
Takes a single callback taking the following arguments:
Example
spider.run(function(err, remains) {
if (err)
console.log('The spider failed:', err.message);
else
console.log(remains.length + ' jobs failed.');
});
Note also that spider.run
is an alias to sandcrawler.run(spider)
.
spider.run(function(err, remains) {
//...
});
// is exactly the same as
sandcrawler.run(spider, function(err, remains) {
//...
});
Pauses the spider execution.
Resumes the spider execution.
Exits the spider and fails every jobs in the queue.
spider.result(function(err, req, res) {
this.exit();
});
sandcrawler.run(function(err, remains) {
// err.message will be 'exited'
});
Spiders materialize their scraping processes as jobs so they can track them and provide their users with useful information.
For instance, every url fed to a spider will be translated, in the spider's lifecycle, as a job having the following keys:
Keys
start
and stop
node process hrtime.The job's request object.
Keys
And any other keys set by the user while adding the job through the url method.
The job's response object.
Keys
If you do not fancy spiders and believe they are a creepy animal that should be shunned, you remain free to use less fearsome names such as droid
or jawa
:
var spider = sandcrawler.spider();
// is the same as
var droid = sandcrawler.droid();
// and the same as
var jawa = sandcrawler.jawa();