Privacy Web Corpus

About this project

This datascape is part of a larger scientific project managed by the medialab of SciencesPo funded by the Axa Research Fund in collaboration with the Axa Data Innovation Lab.

This project aims to explore and analyze the different forms of data regulation such as law, social, code and market (Lessig, 2000). It tries to understand the role that actors like insurance companies may have to manage risks and act as a third party in the context of development of personal data transactions and the rise of data breaches. The main issues consist in understanding how insurance can build trust and enable Big Data. The full project mobilizes different approaches, from use case, to text analysis, ethnographic study and also web analysis.

The Privacy Datascape

We defined this website as a datascape (Latour and al., 2012). A datascape is a tool that allows exploring a dataset from different levels of aggregation and different points of view related to the attributes of each element of the corpus.

The philosophy of this datascape is to always be able to qualify actors (web entities) and the terms of potential controversies (topics and text content of pages). To do this we have designed a tool that allows following the links between web entities, their pages and associated topics. We have also included two visualization tools, a graph to locate web entities, and a matrix to explore links between topics.

What is the Privacy datascape ?

  • A structured dataset of web pages tagged with topics and topological attributes
  • An interactive interface for users to search and navigate the dataset and filter information;
  • A tool to explore and extract qualitative data for further investigations.

What is not the Privacy datascape ?

  • A realtime dataset : the crawl ended in September 2016
  • A final result of a research project : it is a starting point for further research based on this corpus



This website is a datascape intended to explore a corpus built from web contents crawled with the software Hyphe and categorized using LDA algorithms. This website is not public and the content was collected for research purposes only.

All contents are linked back to their original websites. If you are an author of some content reproduced and would like to see it removed from the corpus, or if you have any questions or complaint please contact us.

Medialab Team

Dominique Boullier, Maxime Crépel, Mathieu Jacomy, Benjamin Ooghe-Tabanou, Diego Antolinos-Basso, Paul Monsallier, Audrey Baneyx, Paul Girard