artoo.js’ main goal is to provide you with some useful scraping helpers and this is precisely what the following methods do.
It is advisable, however, to check the quick start section of this documentation to find a less exhaustive but more didactic presentation of artoo.scrape
method.
Note also that every method presented below comes with its jQuery plugin alias:
artoo.scrape('.class', params);
// equals
$('.class').scrape(params);
This helper is the heart of the library’s scraping techniques. It takes a selector as its root iterator and then takes the data model you intent to extract at each step of the iteration.
// Basic signature
artoo.scrape(iterator, model, [params, callback]);
// Alternative signatures
artoo.scrape(configObject, [callback]);
artoo.scrape(iterator, model, [callback]);
For instance, you could need to iterate on a list while extracting the id and the text of each element of the list.
artoo.scrape('li', {id: 'id', content: 'text'});
callback
.Alternatively, you can pass a single object as argument to the scrape
method and taking as properties iterator
, data
and params
.
Choosing a data model when using artoo.scrape
is just a matter of deciding whether you want the function to return an array of values or rather an array of items with properties you prealably chose.
// Passing a single element to the method will
// return an array of the wanted data
artoo.scrape('ul > li', 'text');
>>> ['text of first li', 'text of second li']
// Passing an object to the method will
// return an array of items
artoo.scrape('ul > li', {text: 'text', id: 'id'});
>>> [
{text: 'text of fist li', id: 'first-li'},
{text: 'text of second li', id: 'second-li'}
]
Now that you know what kind of array you want to be returned, you need to specify how to retrieve the wanted data.
There are three ways to get what you want with artoo.scrape
:
Basically, if you pass a string as your retriever, artoo will try to apply the given jQuery method, text
or html
for instance, to the current item in the iteration, else he’ll try to find a relevant attribute.
// We want to retrieve the html of the elements
artoo.scrape('ul > li', 'html');
// We want to retrieve the href of the elements
artoo.scrape('ul a', 'href');
If you need something a little more complex like subselection but want to stay concise, you can also use an expressive object.
// The sel attribute in the object passed as retriever enables you
// to perform a subselection
artoo.scrape('ul > li', {
text: {sel: 'span', method: 'text'},
url: {sel: 'a', attr: 'href'}
});
Possible properties for a retriever object are the following:
NaN
.text
or html
or a custom function..find(sel)
to the current element in iteration). If a method
property is given as a function, $(this)
will correspond to this subselection.Finally, if none of the above methods work for you, you remain free to pass a function that will perform the data retrieval.
Note that a function passed to artoo.scrape
follows jQuery’s paradigm: this
would actually be a reference to the current DOM element in the iteration.
You can therefore use $(this)
as you would in any jQuery callback.
artoo.scrape('ul > li', function() {
return +$(this).attr('data-nb') * 4;
});
Note also that functions passed as retrievers can take an argument which is artoo’s internal jQuery reference. You can read the reason why here.
artoo.scrape('ul > li', {
text: function($) {
return $(this).text();
},
nb: function($) {
return +$(this).attr('data-nb');
}
})
Additionally, a reference to the current DOM element in the iteration is passed in as the second parameter to the function function($, el)
. This will enable the use of arrow functions.
artoo.scrape('ul > li', {
text: ($, el) => $(el).text(),
nb: ($, el) => +$(el).attr('data-nb')
})
If you need recursivity within the artoo.scrape
method, rather that calling the method itself in a function retriever, you can also pass an object with the scrape property like in the example below.
// This
artoo.scrape('ul.list > li', {
scrape: {
iterator: 'ul.sublist > li',
data: 'text'
}
});
// is the same as writing
artoo.scrape('ul.list > li', function($) {
return artoo.scrape($(this).find('ul.sublist > li'), 'text');
});
// And will return
>>> [['item1-1', 'item1-2'], ['item2-1', 'item2-2']]
scrapeOne
works the same way as scrape
but will only return the first element. It is strictly the same as passing 1
as the limit
parameter.
artoo.scrapeOne(iterator, model, [params, callback]);
Example
artoo.scrapeOne('ul a', {url: 'href', title: 'text'});
>>> {
url: 'http://firstelement.com',
title: 'First Element'
}
scrapeTable
is a handful helper when you need to scrape data from a HTML table.
artoo.scrapeTable(selector, [params, callback]);
Arguments
scrape
method if needed.callback
.Headers
It is possible to specify headers for the scraped table. If you do so, instead of returning an array of array, the scrapeTable
method will return an array of objects.
You can specify headers through the following ways:
'first'
to declare the first row as headers or 'th'
if the table as regular HTML headers.Example
First Name | Last Name | Points |
Jill | Smith | 50 |
Eve | Jackson | 94 |
John | Doe | 80 |
Adam | Johnson | 67 |
To scrape the above table and download it as a CSV file, just execute the following command:
artoo.scrapeTable('table.reference', {
headers: 'first',
done: artoo.saveCsv
});