The client-side scraping companion
artoo.js is a piece of JavaScript code meant to be run in your browser’s console to provide you with some scraping utilities.
This nice droid is loaded into the JavaScript context of any webpage through a handy bookmarklet you can instantly install by dropping the above icon onto your bookmark bar.
Web security has widely improved in the last years and most websites prevent JavaScript code injection nowadays by relying on Content Security Policy headers.
But fear not, it is always possible to shunt them using browser extensions and/or proper configuration:
Now that you have installed artoo.js let’s scrape the famous Hacker News in four painless steps:
artoo.scrape('td.title:nth-child(3)', {
title: {sel: 'a'},
url: {sel: 'a', attr: 'href'}
}, artoo.savePrettyJson);
That’s it. You’ve just scraped Hacker News front page and downloaded the data as a pretty-printed json file*.
* If you need a more thorough scraper, check this out.
« Why on earth should I scrape on my browser? Isn’t this insane? »
Well, before quitting the present documentation and run back to your beloved spiders, you should pause for a minute or two and read the reasons why artoo.js has made the choice of client-side scraping.
Usually, the scraping process occurs thusly: we find sites from which we need to retrieve data and we consequently build a program whose goal is to fetch those site’s html and parse it to get what we need.
The only problem with this process is that, nowadays, websites are not just plain html. We need cookies, we need authentication, we need JavaScript execution and a million other things to get proper data.
So, by the days, to cope with this harsh reality, our scraping programs became complex monsters being able to execute JavaScript, authenticate on websites and mimic human behaviour.
But, if you sit back and try to find other programs able to perform all those things, you’ll quickly come to this observation:
Aren’t we trying to rebuild web browsers?
So why shouldn’t we take advantage of this and start scraping within the cosy environment of web browsers? It has become really easy today to execute JavaScript in a a browser’s console and this is exactly what artoo.js is doing.
Using browsers as scraping platforms comes with a lot of advantages:
The intention here is not at all to say that classical scraping is obsolete but rather that client-side scraping is a possibility today and, what’s more, a useful one.
You’ll never find yourself crawling pages massively on a browser, but for most of your scraping tasks, client-side should enhance your productivity dramatically.
Contributions are more than welcome. Feel free to submit any pull request as long as you added unit tests if relevant and passed them all.
To install the development environment, clone your fork and use the following commands:
# Install dependencies
npm install
# Testing
npm test
# Compiling dev & prod bookmarklets
gulp bookmarklets
# Running a test server hosting the concatenated file
npm start
# Running a https server hosting the concatenated file
# Note that you'll need some ssl keys (instructions to come...)
npm run https
artoo.js is being developed by Guillaume Plique @ SciencesPo - médialab.
Logo by Daniele Guido.
R2D2 ascii logo by Joan Stark aka jgs
.
Under a MIT License.