sandcrawler.js is a scraping library aiming at performing complex scraping tasks where client-side JavaScript execution might be needed.
But before being able to do so, let's see how one could perform simpler tasks like scraping the famous Hacker News.
First of all, we need to create a spider that will fetch Hacker News' front page so we can scrape its data.
We'll gather only post titles and urls for the sake of the demonstration but I am sure anyone would be greedier than that in a normal use case.
// Let's start by requiring the library
var sandcrawler = require('sandcrawler');
// Now let's define a new spider and start chaining
var spider = sandcrawler.spider()
// What we need is to hit the following url
.url('https://news.ycombinator.com')
// With the following scraper
.scraper(function($, done) {
var data = $('td.title:nth-child(3)').scrape({
title: {sel: 'a'},
url: {sel: 'a', attr: 'href'}
});
done(null, data);
})
// So that we can handle its result
.result(function(err, req, res) {
console.log('Scraped data:', res.data);
});
To scrape our post title and urls, we need to use a scraping function.
This function takes two arguments:
Once your scraping has been done, or if an error occurred in the process, you'll be able to analyse the results of your actions within a dedicated callback taking the following arguments:
res.data
, for instance, holds the results of your scraper function.Once your spider is correctly defined, you can finally run it:
spider.run(function(err, remains) {
console.log('Finished!');
});
The run
callback accepts two important arguments:
But one might notice that this is quite silly to deploy such shenaningans just to scrape a single page.
Needless to say that sandcrawler enables you to scrape multiple pages. Just pass an array of urls to the spider's urls
method and there you go:
spider.urls([
'https://news.ycombinator.com',
'https://news.ycombinator.com?p=2',
'https://news.ycombinator.com?p=3',
'https://news.ycombinator.com?p=4'
]);
Note also that if you prefer or need to deduce the next urls from the last scraped page, you can also iterate or even add urls to the spider at runtime without further ado.
But sometimes, static scraping is clearly not enough and you might need to perform silly operations such as develop a long infinite scrolling or triggering complex XHR requests requiring great amount of authentication.
For this, you can use the library's phantom spiders that will use phantomjs for you:
// Just change
sandcrawler.spider();
// into
sandcrawler.phantomSpider();
If you ever need more information about differences between regular spiders and phantom ones, you can read this page.
Prototyping scrapers server-side can be tiresome at times.
Fortunately, sandcrawler has been designed to be artoo.js' big brother. The latter makes client-side scraping more comfortable and enables you to prototype, in your browser, scrapers you will run using sandcrawler later.
Indeed, any sandcrawler scraper function can use artoo.js and jQuery seamlessly so you can use your scripts both in the browser and in the server.
As a conclusion, know that, under the hood, sandcrawler's spiders are event emitters. This makes the creation of plugins a very easy task.
For the sake of the example, let's say you want to log "Yeah!" to the console each time a scraping job succeeds:
spider.on('job:success', function(job) {
console.log('Yeah!');
});
One can then easily create a plugin function by writing the following:
function myPlugin(opts) {
return function(spider) {
spider.on('job:success', function(job) {
console.log('Yeah!');
});
};
}
And plug it into his/her spiders likewise:
spider.use(myPlugin());
For more information about plugins or if you want to know if a plugin already exists to tackle your needs, you can check this page.
Now that you know the basics of sandcrawler, feel free to roam (or even scrape...) the present documentation whose summary can be found on your left.
Show more data on your web page than available in your API? That's a scrapin' pic.twitter.com/sGCsFXUTjF
— Andrew Nesbitt (@teabass) January 21, 2015