artoo

sandcrawler.js

The server-side scraping companion


Disclaimer: this library is an UNRELEASED work in progress.


sandcrawler.js is a nodejs/iojs library aiming at providing developers with concise but exhaustive tools to scrape the web.


// Scraping the famous Hacker News
var sandcrawler = require('sandcrawler');

var spider = sandcrawler.spider()
  .url('https://news.ycombinator.com/')
  .scraper(function($, done) {

    var data = $('td.title:nth-child(3)').scrape({
      title: {sel: 'a'},
      url: {sel: 'a', attr: 'href'}
    });

    done(null, data);
  })
  .result(function(err, req, res) {
    console.log('Scraped data:', res.data);
  })
  .run(function(err, remains) {
    console.log('And we are done!');
  });

Features

  • Spider abstraction: define your scraping jobs easily with the library's chainable methods and run them without further ado.
  • Fully customizable: You want really precise headers, a custom user-agent and complex logic? You'll have them all.
  • Phantomjs: Let the library handle phantomjs for you if you need browser emulation.
  • Complex dynamic scraping: Need to log into facebook and auto-scroll/expand a full page? sandcrawler is made for you.
  • Reliable: Never lose data during your scraping process anymore thanks to the library's paranoid strategies.
  • Scalable: sandcrawler has been battle-hardened across the web's most dirty places.
  • Easy prototyping: Design your scraping scripts within your browser thanks to artoo.js and use the same script with sandcrawler to perform the job.
  • Reusable logic: Creating plugins for sandcrawler is really easy. Use already existing ones or build yours to fit your needs.

Installation

You can install the latest version of sandcrawler.js with npm (note that it will install phantomjs for you thanks to this package):

npm install sandcrawler@beta

You can also install the latest development version thusly:

npm install git+https://github.com/medialab/sandcrawler.git

Usage

To start using the library, head towards the Quick Start section for a fast tutorial or browse the documentation through the navigation at your left.

var sandcrawler = require('sandcrawler');

Philosophy

sandcrawler.js is being developed by scraper developers for scraper developers with the following concepts in mind:

  • Not a framework: sandcrawler is a library and not a framework so that people can remain free to develop things their own way.
  • Exhaustivity over minimalistic API: every detail can be customized. This comes at the cost of a bigger code footprint for one-shot projects but with more reliance for big ones.
  • Asynchronicity: sandcrawler is not trying to fight the asynchronous nature of client-side JavaScript. If you want to be able to perform complex scraping tasks on modern dynamic websites, you won't be able to avoid asynchronicity anyway.
  • Better workflow: sandcrawler aims at enabling developers to design their scraping scripts within the cosy environment of their browsers using artoo.js so they can automatize them easily afterwards.

Plugins

  • sandcrawler-logger: Simple logger to plug into one of your spiders for immediate feedback.
  • sandcrawler-dashboard: A handy terminal dashboard displaying advanced information about one of your spiders.

Contribution

Build Status

Contributions are more than welcome. Feel free to submit any pull request as long as you added unit tests if relevant and passed them all.

To install the development environment, clone your fork and use the following commands:

# Install dependencies
npm install

# Testing
npm test

Authors

sandcrawler.js is being developed by Guillaume Plique @ SciencesPo - médialab.

Logo by Daniele Guido.


Under a MIT License.