Enrichment tools

The enrichment workflow uses scraping and minet's API clients to collect data about URLs and associated shared content in the SQLite database's links table and shared_content table, respectively. The collection method that minall deploys is dependent on the URL's domain name.

The URLs in the SQLite database's links table are parsed using the ural Python library and grouped in the following subsets:

Subsets of URLs

subset	dataclass	module
URLs from Facebook	Normalized Facebook Post Data	`minall.enrichment.crowdtangle.get_data`
URLs from YouTube	Normalized YouTube Video Data, Normalized YouTube Channel Data	`minall.enrichment.youtube.get_data`
URLs from other social media platforms (because they cannot be scraped)	NA	`minall.enrichment.other_social_media.add_data`
URLs not from social media platforms (they can be scraped)	Normalized Scraped Web Page Data	`minall.enrichment.article_text.get_data`

If the user has a Buzzsumo API token, all URLs, regardless of grouping by domain name, are searched in the Buzzsumo database.

All URLs

subset	dataclass	module
all URLs	Normalized Buzzsumo Exact URL Data	`minall.enrichment.buzzsumo.get_data`

Due to the diversity of data available for different types of URLs and provided by different data sources, an important step in all the enrichment procedures is normalizing the data. Each target URL's metadata, regardless of its domain name, must conform to the SQLite database's links table. Such harmonization of data fields constitutes an important feature of the minall workflow and is not (yet) something replicated in minet.

The following table illustrates which of each data source's data fields are matched to which column in the database's links table.

`links` SQL table	Normalized Scraped Web Page Data	Normalized Buzzsumo Exact URL Data	Normalized Facebook Post Data	Normalized YouTube Video Data	Normalized YouTube Channel Data	Normalized Tweet
url (TEXT)	X	X	X	X	X	X
domain (TEXT)		X	X ("facebook.com")	X ("youtube.com")	X ("youtube.com")	X ("twitter.com")
work_type (TEXT)	X ("WebPage")	X ("WebPage", "Article", "VideoObject")	X ("SocialMediaPosting", "ImageObject", "VideoObject")	X ("VideoObject")	X ("WebPage")	X ("SocialMediaPosting")
duration (TEXT)		X	X	X
identifier (TEXT)			X	X	X	X
date_published (TEXT)	X	X	X	X	X	X
date_modified (TEXT)			X
country_of_origin (TEXT)					X
abstract (TEXT)			X	X	X
keywords (TEXT)				X	X
title (TEXT)	X	X	X	X	X
text (TEXT)	X		X			X
hashtags (TEXT)						X
creator_type (TEXT)			X ("defacto:SocialMediaAccount")	X ("WebPgae")		X ("defacto:SocialMediaAccount")
creator_date_created (TEXT)				X		X
creator_location_created (TEXT)			X	X
creator_identifier (TEXT)		X	X	X		X
creator_facebook_follow (INTEGER)
creator_facebook_subscribe (INTEGER)			X
creator_twitter_follow (INTEGER)					X
creator_youtube_subscribe (INTEGER)				X
creator_create_video (INTEGER)				X
creator_name (TEXT)		X	X	X		X
creator_url (TEXT)			X
facebook_comment (INTEGER)		X	X
facebook_like (INTEGER)			X
facebook_share (INTEGER)		X	X
pinterest_share (INTEGER)		X
twitter_share (INTEGER)		X				X
tiktok_share (INTEGER)		X
tiktok_comment (INTEGER)		X
reddit_engagement (INTEGER)		X
youtube_watch (INTEGER)		X		X
youtube_comment (INTEGER)				X
youtube_like (INTEGER)		X		X
youtube_favorite (INTEGER)
youtube_subscribe (INTEGER)					X
create_video (INTEGER)					X

Note: creator_facebook_follow does not have any data field feeding to it. I still need to confirm the use of creator_facebook_follow (Facebook accounts' FollowAction might have been made redundant by the SubscribeAction).