width:250px

Empowering social scientists
with web mining tools

FOSDEM 2020

Open Research Tools and Technologies Devroom

Guillaume Plique, SciencesPo médialab

Why and how to enable researchers to perform complex web mining tasks?

Guillaume Plique, a.k.a. Yomguithereal

logo-medialab w:350px

idefi w:150px

What is web mining?

Scraping

echojs

echojs-html

Crawling

hyphe-network

Collecting data from APIs

twitter-api

But why is this useful to [social] sciences?

Bad take

  1. Every social sciences data collection is biaised (i.e. observer's paradox)
  2. People express themselves without being asked to, on the Internet
  3. What's more they are not being observed (lol, I know...)
  4. Web mining is therefore a superior source of data for social sciences!

Good take

  1. Internet data comes with its own biases that you should be aware of
  2. Apply media studies and STS without moderation
  3. Still is another very interesting and large data source!

Web mining is hard

You need to know The Web™:

DNS HTTP HTML CSS JS DOM AJAX SSR CSR XPATH ...

How do you teach researchers web technologies

  1. The same as anyone else really (CSS as sushi plates anyone?)
  2. What most consider as an easy layer of technologies really ISN'T
  3. We really are standing on the shoulders of giants

Teaching researchers how to scrape

  1. Fighting the platforms and their APIs
  2. Legal issues in some countries
  3. Sometimes forbidden to teach it (~lock picking)
  4. Publication wiggles (the monkey army)

Jupyterizing researchers is not a solution

  1. Some researchers don't have the time nor the will to learn python and web stuff.
  2. We should be OK with that!

Web mining is HARD

It really is a craftsmanship.

Internet is a dirty, dirty place

Browsers truly are heuristical wonders!

Multithreading, parallelization, throttling etc.

Once we cut access to Google to our whole university!

Complex spidering, scalability, storage, indexing, recombobulation, steam engines, fancy boats, unionization, agility, upper management, Peters syndrom, eXtreme programming

Most of it is irrelevant and made up but you get the point...

How do we empower researchers then?

By designing tools suited to their research questions

SciencesPo's médialab

  1. Social Science Researchers
  2. Designers
  3. Engineers

A brief guided tour of tools we designed

  1. artoo.js
  2. minet
  3. Hyphe
  4. (Gazouilloire)

Parasitizing web browsers instead of emulating them!

artoo h:300px

Demo Time!

Leveraging bookmarklets to empower researchers

artoo h:450px

But can we scale up?

img

Not-contractual logo - Jules Farjas ©

Handling the pesky details for you

  1. Multithreaded, memory-efficient fetching from the web.
  2. Multithreaded, scalable crawling using a comfy DSL.
  3. Multiprocessed raw text content extraction from HTML pages.
  4. Multiprocessed scraping from HTML pages using a comfy DSL.
  5. URL-related heuristics utilities such as normalization and matching.
  6. Data collection from various APIs such as CrowdTangle.

The Unix philosophy

Do one thing well

xsv search -s url urls.csv | minet fetch url -d html > result.txt

Demo time!

The low-fi approach

img

img

Relocalizing data collection

  1. Sometimes you don't need a server
  2. We are rarely doing BigData™
  3. Let's put the researcher at the center so they can control their data

A programmatic API

Jupyter's back y'all!

from minet import multithreaded_fetch

for result in multithreaded_fetch(urls_iterator):
  print(result.status)

How to enable researchers to crawl the Web?

hyphe

A dedicated interface

hyphe-network

Serving a robust methodology

hyphe-methodology h:450px

Non-trivial technical challenges

its-a-traph

Trade-off between scalability & usability

We need to be able to design user paths.

The future!

What about a GUI for minet?

Thank you for listening!

bernard-minet

Note: Google Trends example

Note: anecdote about the Selenium researchers

Note: used for several hit jobs

Note: used for polarisation