Enrichment tools
The enrichment workflow uses scraping and minet
's API clients to collect data about URLs and associated shared content in the SQLite database's links
table and shared_content
table, respectively. The collection method that minall
deploys is dependent on the URL's domain name.
The URLs in the SQLite database's links
table are parsed using the ural
Python library and grouped in the following subsets:
Subsets of URLs
subset | dataclass | module |
---|---|---|
URLs from Facebook | Normalized Facebook Post Data | minall.enrichment.crowdtangle.get_data |
URLs from YouTube | Normalized YouTube Video Data, Normalized YouTube Channel Data | minall.enrichment.youtube.get_data |
URLs from other social media platforms (because they cannot be scraped) | NA | minall.enrichment.other_social_media.add_data |
URLs not from social media platforms (they can be scraped) | Normalized Scraped Web Page Data | minall.enrichment.article_text.get_data |
If the user has a Buzzsumo API token, all URLs, regardless of grouping by domain name, are searched in the Buzzsumo database.
All URLs
subset | dataclass | module |
---|---|---|
all URLs | Normalized Buzzsumo Exact URL Data | minall.enrichment.buzzsumo.get_data |
Due to the diversity of data available for different types of URLs and provided by different data sources, an important step in all the enrichment procedures is normalizing the data. Each target URL's metadata, regardless of its domain name, must conform to the SQLite database's links
table. Such harmonization of data fields constitutes an important feature of the minall
workflow and is not (yet) something replicated in minet
.
The following table illustrates which of each data source's data fields are matched to which column in the database's links
table.
links SQL table |
Normalized Scraped Web Page Data | Normalized Buzzsumo Exact URL Data | Normalized Facebook Post Data | Normalized YouTube Video Data | Normalized YouTube Channel Data | Normalized Tweet |
---|---|---|---|---|---|---|
url (TEXT) | X | X | X | X | X | X |
domain (TEXT) | X | X ("facebook.com") | X ("youtube.com") | X ("youtube.com") | X ("twitter.com") | |
work_type (TEXT) | X ("WebPage") | X ("WebPage", "Article", "VideoObject") | X ("SocialMediaPosting", "ImageObject", "VideoObject") | X ("VideoObject") | X ("WebPage") | X ("SocialMediaPosting") |
duration (TEXT) | X | X | X | |||
identifier (TEXT) | X | X | X | X | ||
date_published (TEXT) | X | X | X | X | X | X |
date_modified (TEXT) | X | |||||
country_of_origin (TEXT) | X | |||||
abstract (TEXT) | X | X | X | |||
keywords (TEXT) | X | X | ||||
title (TEXT) | X | X | X | X | X | |
text (TEXT) | X | X | X | |||
hashtags (TEXT) | X | |||||
creator_type (TEXT) | X ("defacto:SocialMediaAccount") | X ("WebPgae") | X ("defacto:SocialMediaAccount") | |||
creator_date_created (TEXT) | X | X | ||||
creator_location_created (TEXT) | X | X | ||||
creator_identifier (TEXT) | X | X | X | X | ||
creator_facebook_follow (INTEGER) | ||||||
creator_facebook_subscribe (INTEGER) | X | |||||
creator_twitter_follow (INTEGER) | X | |||||
creator_youtube_subscribe (INTEGER) | X | |||||
creator_create_video (INTEGER) | X | |||||
creator_name (TEXT) | X | X | X | X | ||
creator_url (TEXT) | X | |||||
facebook_comment (INTEGER) | X | X | ||||
facebook_like (INTEGER) | X | |||||
facebook_share (INTEGER) | X | X | ||||
pinterest_share (INTEGER) | X | |||||
twitter_share (INTEGER) | X | X | ||||
tiktok_share (INTEGER) | X | |||||
tiktok_comment (INTEGER) | X | |||||
reddit_engagement (INTEGER) | X | |||||
youtube_watch (INTEGER) | X | X | ||||
youtube_comment (INTEGER) | X | |||||
youtube_like (INTEGER) | X | X | ||||
youtube_favorite (INTEGER) | ||||||
youtube_subscribe (INTEGER) | X | |||||
create_video (INTEGER) | X |
Note: creator_facebook_follow
does not have any data field feeding to it. I still need to confirm the use of creator_facebook_follow
(Facebook accounts' FollowAction
might have been made redundant by the SubscribeAction
).
minall.enrichment.enrichment
Class for data collection and coalescing.
With the class Enrichment
, this module manages the data collection process.
Enrichment
Source code in minall/enrichment/enrichment.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
|
__init__(links_table, shared_content_table, keys)
From given API keys and URL data set, filter URLs by domain and initialize data enrichment class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
links_table |
BaseTable
|
BaseTable class instance of SQL table for URL dataset. |
required |
shared_content_table |
BaseTable
|
BaseTable class instance of SQL table for shared content related to URLs in dataset. |
required |
keys |
APIKeys
|
APIKeys class instance of minet API client configurations. |
required |
Source code in minall/enrichment/enrichment.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
|
buzzsumo()
For all URLs, collect data from Buzzsumo and coalesce in the database's 'links' table.
Source code in minall/enrichment/enrichment.py
43 44 45 46 47 48 49 50 51 52 |
|
facebook()
For Facebook URLs, collect data from CrowdTangle and coalesce in the database's 'links' and 'shared_content' tables.
Source code in minall/enrichment/enrichment.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
|
other_social_media()
For select URLs, update the 'work_type' column in the database's 'links' table with the value 'SocialMediaPosting'.
Source code in minall/enrichment/enrichment.py
64 65 66 67 68 69 70 71 |
|
scraper()
For select URLs, collect data via scraping and coalesce in the database's 'links' table.
Source code in minall/enrichment/enrichment.py
54 55 56 57 58 59 60 61 62 |
|
twitter()
For Twitter URLs, scrape data from site and coalesce in teh database's 'links' and 'shared_content' tables.
Source code in minall/enrichment/enrichment.py
73 74 75 76 77 78 79 80 81 82 83 84 |
|
youtube()
For YouTube URLs, collect data from YouTube API and coalesce in the database's 'links' table.
Source code in minall/enrichment/enrichment.py
102 103 104 105 106 107 108 109 110 111 112 |
|
minall.enrichment.utils
Functions for data collection.
This module provides the following class and functions:
get_domain(url)
- Parse domain from URL string.apply_domain(url)
- Generate SQL query to insert domain into table.FilteredLinks(table)
- From SQL table, select subsets of URLs based on domain name.
FilteredLinks
Selects all URLs from SQL table and returns subsets.
Source code in minall/enrichment/utils.py
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
|
facebook: List[str]
property
List of URLs from Facebook.
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: List of URL strings. |
other_social: List[str]
property
List of URLs from social media platforms.
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: List of URL strings. |
to_scrape: List[str]
property
List of URLs not from social media platforms.
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: List of URL strings. |
twitter: List[str]
property
List of URLs from Twitter.
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: List of URL strings. |
youtube: List[str]
property
List of URLs from YouTube.
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: List of URL strings. |
__init__(table)
Select and store all URLs from a target SQL table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table |
BaseTable
|
Target SQL table. |
required |
Source code in minall/enrichment/utils.py
67 68 69 70 71 72 73 74 75 76 |
|
apply_domain(url)
Compose SQL query to update the domain column of a URL's row in the 'links' SQLite table.
Examples:
>>> apply_domain(url="https://www.youtube.com/channel/MkDocs")
("UPDATE links SET domain = 'youtube.com' WHERE url = 'https://www.youtube.com/channel/MkDocs'", 'youtube.com')
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
URL string. |
required |
Returns:
Type | Description |
---|---|
Tuple[str | None, str | None]
|
Tuple[str | None, str | None]: If domain was parsed, a tuple containing the SQL query and domain name. |
Source code in minall/enrichment/utils.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
|
get_domain(url)
Parse the domain name of a given URL string.
Examples:
>>> get_domain(url="https://www.youtube.com/channel/MkDocs")
'youtube.com'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
URL string. |
required |
Returns:
Type | Description |
---|---|
str | None
|
str | None: If successfully parsed, domain name. |
Source code in minall/enrichment/utils.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
|
minall.enrichment.buzzsumo
Enrichment workflow's Buzzsumo data collection.
Modules exported by this package:
normalizer
: Dataclass to normlalize minet's Buzzsumo result object.contexts
: Context manager for client's CSV writers, multi-threader, and progress bar.get_data
: Function that runs all of the Buzzsumo enrichment process.client
: Wrapper for minet's Buzzsumo API client that normalizes minet's result.
minall.enrichment.buzzsumo.normalizer
Module contains constants for minet's Buzzsumo API client and a dataclass to normalize minet's Buzzsumo result.
NormalizedBuzzsumoResult
dataclass
Bases: TabularRecord
Dataclass to normalize minet's Buzzsumo API result.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
Target URL searched in Buzzsumo database. |
work_type |
str
|
Target URL's ontological subtype, i.e. "WebPage", "Article", "VideoObject". |
twitter_share |
int
|
Number of times the target URL appeared on Twitter. |
facebook_share |
int
|
Number of times the target URL appeared on Facebook. |
title |
str
|
Title of target URL web content. |
date_published |
datetime
|
Date target URL web content was published. |
pinterest_share |
int
|
Number of times the target URL appeared on Pinterest. |
creator_name |
str
|
Entity intellectually responsible for (author of) the target URL's web content. |
creator_identifier |
str
|
If target URL is a social media post, the platform's identifier for the author. |
duration |
int
|
If the target URL is of a video, the video's duration. |
facebook_comment |
int
|
Number of times Facebook users commented on a post containing the target URL. |
youtube_watch |
int
|
If the target URL is of a YouTube video, number of times YouTube users watched the video. |
youtube_like |
int
|
If the target URL is of a YouTube video, number of times YouTube users liked the video. |
tiktok_share |
int
|
If the target URL is of TikTok content, number of shares on TikTok. |
tiktok_comment |
int
|
If the target URL is of TikTok content, number of times TikTok users commented on the content. |
reddit_engagement |
int
|
Number of times the target URL appeared on Reddit. |
Source code in minall/enrichment/buzzsumo/normalizer.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
|
from_payload(url, data)
classmethod
Parses minet's Buzzsumo result and creates normalized dataclass.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
Target URL searched in Buzzsumo's database. |
required |
data |
BuzzsumoArticle | None
|
If target URL was found in Buzzsumo database, minet's result; otherwise, None. |
required |
Returns:
Name | Type | Description |
---|---|---|
NormalizedBuzzsumoResult |
NormalizedBuzzsumoResult
|
Dataclass that normalizes minet's Buzzsumo data. |
Source code in minall/enrichment/buzzsumo/normalizer.py
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
|
parse_buzzsumo_type(data)
classmethod
Helper function for transforming Buzzsumo's content classification into Schema.org subtype.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
BuzzsumoArticle | None
|
If target URL was found in Buzzsumo database, minet's result; otherwise, None. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Schema for web content's subtype. |
Source code in minall/enrichment/buzzsumo/normalizer.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
|
minall.enrichment.buzzsumo.client
Module containing a wrapper for minet's Buzzsumo API client.
BuzzsumoClient
Wrapper for minet's Buzzsumo API client.
Examples:
>>> import os
>>> wrapper = BuzzsumoClient(token=os.environ["BUZZSUMO_TOKEN"])
>>> url="https://archive.fosdem.org/2020/schedule/event/open_research_web_mining/"
>>> wrapper(url)
NormalizedBuzzsumoResult(url='https://archive.fosdem.org/2020/schedule/event/open_research_web_mining/', work_type='Article', domain='fosdem.org', twitter_share=0, facebook_share=0, title='FOSDEM 2020 - Empowering social scientists with web mining tools', date_published=datetime.datetime(2024, 1, 4, 15, 48, 1), pinterest_share=0, creator_name=None, creator_identifier=None, duration=None, facebook_comment=0, youtube_watch=None, youtube_like=None, youtube_comment=None, tiktok_share=None, tiktok_comment=None, reddit_engagement=0)
Source code in minall/enrichment/buzzsumo/client.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
|
__call__(url)
Executes mient's Buzzsumo API client on a URL and returns normalized data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
Target URL. |
required |
Returns:
Name | Type | Description |
---|---|---|
NormalizedBuzzsumoResult |
NormalizedBuzzsumoResult
|
Dataclass that normalizes minet's Buzzsumo result. |
Source code in minall/enrichment/buzzsumo/client.py
43 44 45 46 47 48 49 50 51 52 53 54 55 |
|
__init__(token)
Creates an instance of mient's BuzzsumoAPIClient and sets values for the Buzzsumo API's requried begin-date and end-date parameters.
Examples:
>>> wrapper = BuzzsumoClient(token="<TOKEN>")
>>> type(wrapper)
<class 'minall.enrichment.buzzsumo.client.BuzzsumoClient'>
>>> type(wrapper.client)
<class 'minet.buzzsumo.client.BuzzSumoAPIClient'>
Parameters:
Name | Type | Description | Default |
---|---|---|---|
token |
str
|
Buzzsumo API token. |
required |
Source code in minall/enrichment/buzzsumo/client.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
|
minall.enrichment.buzzsumo.get_data
Module containing a function that runs all of the Buzzsumo enrichment process.
get_buzzsumo_data(data, token, outfile)
Main function for writing Buzzsumo API results to a CSV file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
List[str]
|
List of URLs. |
required |
token |
str
|
Token for Buzzsumo API. |
required |
outfile |
Path
|
Path to CSV file in which to write results. |
required |
Source code in minall/enrichment/buzzsumo/get_data.py
14 15 16 17 18 19 20 21 22 23 24 25 |
|
minall.enrichment.buzzsumo.contexts
Module containing contexts for Buzzsumo data collection's CSV writer, progress bar, and multi-threader.
GeneratorContext
Source code in minall/enrichment/buzzsumo/contexts.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
|
__enter__()
Start the wrapper's context variables.
Returns:
Type | Description |
---|---|
Tuple[Progress, ThreadPoolExecutor]
|
Tuple[Progress, ThreadPoolExecutor]: Context variables. |
Source code in minall/enrichment/buzzsumo/contexts.py
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
|
__exit__(exc_type, exc_val, exc_tb)
Stop the Buzzsumo client wrapper's context variables.
Source code in minall/enrichment/buzzsumo/contexts.py
75 76 77 78 |
|
__init__()
Set up class for Buzzsumo client wrapper's contexts.
Source code in minall/enrichment/buzzsumo/contexts.py
53 54 55 |
|
WriterContext
Source code in minall/enrichment/buzzsumo/contexts.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
|
__enter__()
Start the CSV writer's context.
Returns:
Type | Description |
---|---|
DictWriter
|
csv.DictWriter: Context variable for writing CSV rows. |
Source code in minall/enrichment/buzzsumo/contexts.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
|
__exit__(exc_type, exc_val, exc_tb)
Stops the writer's context variable.
Source code in minall/enrichment/buzzsumo/contexts.py
46 47 48 49 |
|
__init__(links_file)
Set up class for iteratively writing normalized Buzzsumo results to CSV.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
links_file |
Path
|
Path to the links table CSV file. |
required |
Source code in minall/enrichment/buzzsumo/contexts.py
23 24 25 26 27 28 29 |
|
minall.enrichment.twitter
minall.enrichment.twitter.normalizer
Module contains constants for minet's Twitter Guest API Scraper and a dataclass to normalize minet's result.
NormalizedSharedLink
dataclass
Bases: TabularRecord
Dataclass to normalized media embedded in Tweet.
Attributes:
Name | Type | Description |
---|---|---|
post_url |
str
|
URL of the Tweet in which the content is embedded. |
content_url |
str
|
URL of the embedded content. |
media_type |
str
|
Content URL's ontological subtype, i.e. "VideoObject". |
height |
int | None
|
Height of the embedded visual media content. Default = None. |
width |
int | None
|
Width of the embedded visual media content. Default = None. |
Source code in minall/enrichment/twitter/normalizer.py
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
|
NormalizedTweet
dataclass
Bases: TabularRecord
Dataclass to normalize minet's Twitter API result.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
Target URL of Tweet searched on Twitter. |
identifier |
Optional[str]
|
Tweet ID. Default = None. |
date_published |
Optional[datetime]
|
UTC timestamp of Tweet publication, parsed as DateTime object. Default = None. |
text |
Optional[str]
|
Text of Tweet . Default = None. |
creator_date_created |
Optional[datetime]
|
UTC timestamp of User account creation, parsed as DateTime object. Default = None. |
creator_identifier |
Optional[str]
|
User account ID . Default = None. |
creator_twitter_follow |
Optional[int]
|
Number of accounts that follow the User account that published the Tweet. Default = None. |
creator_name |
Optional[str]
|
Name of the User account. Default = None. |
twitter_share |
Optional[int]
|
Number of times the Tweet was retweeted. Default = None. |
twitter_like |
Optional[int]
|
Number of times the Tweet was liked. Default = None. |
domain |
Optional[str]
|
Domain name of the target URL. Default = "twitter.com". |
work_type |
str
|
Target URL's ontological subtype. Default = "SocialMediaPosting". |
hashtags |
List
|
Hashtags embedded in Tweet. Default = []. |
creator_type |
str
|
Ontological subtype of the User account. Default = "defacto:SocialMediaAccount". |
Source code in minall/enrichment/twitter/normalizer.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
|
parse_shared_content(url, tweet)
Parse the potentially multiple media embedded in a Tweet and yield each one.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
URL of Tweet with embedded media. |
required |
tweet |
Dict | None
|
Full metadata of Tweet. |
required |
Yields:
Type | Description |
---|---|
NormalizedSharedLink | None
|
Generator[NormalizedSharedLink | None, None, None]: If media could be parsed, its normalized metadata. |
Source code in minall/enrichment/twitter/normalizer.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
|
minall.enrichment.twitter.scraper
Module contains TweetScraper wrapper for setting up and calling the minet TweetGuestAPIScraper
TweetScraper
Wrapper for running minet's Tweet Guest API Scraper.
Examples:
>>> scraper = TweetScraper()
>>>
>>> tweet_url = "https://twitter.com/Paris2024/status/1551605445156012038"
>>>
>>> url, tweet = scraper(tweet_url)
>>>
>>> url
'https://twitter.com/Paris2024/status/1551605445156012038'
>>>
>>> tweet["local_time"]
'2022-07-25T16:29:00'
Source code in minall/enrichment/twitter/scraper.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
|
__call__(url)
If the URL is of a Tweet and the ID can be parsed, scrape and return data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
URL of Tweet. |
required |
Returns:
Type | Description |
---|---|
Tuple[str, Dict | None]
|
Tuple[str, Dict | None]: If data could be scraped, the target URL and the data; otherwise the unsuccessful URL and None. |
Source code in minall/enrichment/twitter/scraper.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
|
__init__()
Set up minet's Twitter Guest API Scraper
Source code in minall/enrichment/twitter/scraper.py
28 29 30 31 |
|
minall.enrichment.twitter.get_data
Module contains functions for collecting, normalizing, and writing data from Twitter.
get_twitter_data(data, links_outfile, shared_content_outfile)
Transforms a set of Twitter URLs into collected Tweet metadata, written to CSV files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
List[str]
|
Set of Twitter ULRs. |
required |
links_outfile |
Path
|
Path to CSV file for Tweet metadata. |
required |
shared_content_outfile |
Path
|
Path to CSV file for metadata about links in Tweets. |
required |
Source code in minall/enrichment/twitter/get_data.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
|
minall.enrichment.crowdtangle
Enrichment workflow's CrowdTangle data collection.
Modules exported by this package:
normalizer
: Dataclass to normlalize minet's CrowdTangle result object.contexts
: Context manager for client's CSV writers, multi-threader, and progress bar.get_data
: Function that runs all of the CrowdTangle enrichment process.client
: Wrapper for minet's CrowdTangle API client that normalizes minet's result.exceptions
:
minall.enrichment.crowdtangle.normalizer
Module contains functions and dataclasses to normalize minet's CrowdTangle API client result.
NormalizedFacebookPost
dataclass
Bases: TabularRecord
Dataclass to normalize minet's CrowdTangle API result for Facebook posts.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
Target Facebook URL. |
work_type |
str
|
Target Facebook content's ontological subtype, i.e. "SocialMediaPosting", "ImageObject", "VideoObject". |
duration |
str
|
If a Facebook content is a video, the video's duration. |
identifier |
str
|
Facebook's identifier for the post. |
date_published |
str
|
Date of teh Facebook post's publication. |
date_modified |
str
|
Date when the Facebook post was last modified. |
title |
str
|
If applicable, title of the Facebook post content. |
abstract |
str
|
If applicable, description of the Facebook post content. |
text |
str
|
If applicable, text of Facebook post content. |
creator_identifier |
str
|
Facebook's identifier for the post's creator. |
creator_name |
str
|
Name of entity responsible for the Facebook post publication. |
creator_location_created |
str
|
If available, principal country in which is located the entity responsible for the post's publication. |
creator_url |
str
|
URL for the entity responsible for the Facebook post's publication. |
creator_facebook_subscribe |
int
|
Number of Facebook accounts subscribed to the account of the entity responsible for the Facebook post's publication. |
facebook_comment |
int
|
Number of comments on the Facebook post. |
facebook_like |
int
|
Number of Facebook accounts that have liked the Facebook post. |
facebook_share |
int
|
Number of times the Facebook post has been shared on Facebook. |
domain |
str
|
Domain for the Facebook post's URL. Default = "facebook.com". |
creator_type |
str
|
Ontological subtype for the Facebook post's creator. Default = "defacto:SocialMediaAccount". |
Source code in minall/enrichment/crowdtangle/normalizer.py
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
|
from_payload(url, result)
classmethod
Parses minet's CrowdTangle result and creates normalized dataclass.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
Target Facebook URL. |
required |
result |
CrowdTanglePost
|
Result object returned from minet's CrowdTangle API client. |
required |
Returns:
Name | Type | Description |
---|---|---|
NormalizedFacebookPost |
NormalizedFacebookPost
|
Dataclass that normalizes minet's CrowdTangle data. |
Source code in minall/enrichment/crowdtangle/normalizer.py
155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
|
NormalizedSharedContent
dataclass
Bases: TabularRecord
Dataclass for normalizing data about media content shared in a Facebook post.
Attributes:
Name | Type | Description |
---|---|---|
post_url |
str
|
Target Facebook URL, which shared the media content. |
content_url |
str
|
CrowdTangle's URI for the shared media. |
media_type |
str
|
Ontological subtype for the shared media, i.e. "ImageObject". |
height |
int | None
|
If available, the height in pixels of the shared media. |
width |
int | None
|
If available, the width in pixels of the shared media. |
Source code in minall/enrichment/crowdtangle/normalizer.py
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|
from_payload(url, media)
classmethod
Parses JSON data in CrowdTanglePost's "media" attribute.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
URL of Facebook post that contains shared media. |
required |
media |
dict
|
JSON in CrowdTanglePost's "media" attribute. |
required |
Returns:
Name | Type | Description |
---|---|---|
NormalizedSharedContent |
NormalizedSharedContent
|
Dataclass that normalizes information about Facebook post's shared media content. |
Source code in minall/enrichment/crowdtangle/normalizer.py
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|
parse_media_type(type)
classmethod
Helper function to transform CrowdTangle's media classification into Schema.org's CreativeWork subtype.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
type |
str | None
|
If available, CrowdTangle's classification of the media object. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Schema.org's CreativeWork subtype. |
Source code in minall/enrichment/crowdtangle/normalizer.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
|
parse_facebook_post(url, result)
Transform minet's CrowdTanglePost object into normalized data as CSV dict row.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
Target Facebook URL. |
required |
result |
CrowdTanglePost | None
|
If CrowdTangle API returned a match, minet's CrowdTangle API result object. |
required |
Returns:
Name | Type | Description |
---|---|---|
Dict |
Dict
|
Normalized data for Facebook post. |
Source code in minall/enrichment/crowdtangle/normalizer.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
parse_shared_content(url, result)
Generator that streams the "media" attribute from minet's CrowdTanglePost object and returns normalized data as CSV dict row.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
Target Facebook URL. |
required |
result |
CrowdTanglePost
|
minet's CrowdTangle API client result object. |
required |
Yields:
Type | Description |
---|---|
Dict
|
Generator[Dict, None, None]: Formatted CSV dict row of normalized shared content data. |
Source code in minall/enrichment/crowdtangle/normalizer.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
minall.enrichment.crowdtangle.client
Module contains a client and helper functions for collecting data from CrowdTangle.
CTClient
Wrapper for minet's CrowdTangle API client with helper function for parsing Facebook post ID.
Source code in minall/enrichment/crowdtangle/client.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
|
__call__(url)
Execute collection of CrowdTangle data from parsed Facebook post ID.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
Target Facebook URL. |
required |
Returns:
Type | Description |
---|---|
Tuple[str, CrowdTanglePost | None]
|
Tuple[str, CrowdTanglePost | None]: Target URL and, if successful, minet's CrowdTanglePost result object. |
Source code in minall/enrichment/crowdtangle/client.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
|
__init__(token, rate_limit)
Create instance of minet's CrowdTangle API client.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
token |
str
|
CrowdTangle API token. |
required |
rate_limit |
int
|
CrowdTangle API rate limit. |
required |
Source code in minall/enrichment/crowdtangle/client.py
71 72 73 74 75 76 77 78 |
|
adhoc_post_id_parser(url)
Helper function to catch and fix problems parsing Facebook post ID.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
Target Facebook URL. |
required |
Returns:
Type | Description |
---|---|
str | None
|
str | None: If successful, post ID for target Facebook URL. |
Source code in minall/enrichment/crowdtangle/client.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
|
parse_rate_limit(rate_limit)
Set default or convert rate limit string to integer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
rate_limit |
int | str | None
|
Value of rate limit for CrowdTangle API. |
required |
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Converted rate limit integer for CrowdTangle API. |
Source code in minall/enrichment/crowdtangle/client.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
minall.enrichment.crowdtangle.get_data
Module contains functions for collecting, normalizing, and writing data from CrowdTangle.
get_facebook_post_data(data, token, rate_limit, links_outfile, shared_content_outfile)
Function to collect, normalize, and write data from CrowdTangle.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
List[str]
|
Set of target Facebook URLs. |
required |
token |
str
|
CrowdTangle API token. |
required |
rate_limit |
int | str | None
|
CrowdTangle API rate limit. |
required |
links_outfile |
Path
|
Path to CSV file for Facebook post metadata. |
required |
shared_content_outfile |
Path
|
Path to CSV for shared content metadata. |
required |
Source code in minall/enrichment/crowdtangle/get_data.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
|
yield_facebook_data(data, token, rate_limit)
Streams target Facebook URLs to multi-threading context and yields minet's results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
List[str]
|
Set of target Facebook URLs. |
required |
token |
str
|
CrowdTangle API token. |
required |
rate_limit |
int
|
CrowdTangle API rate limit. |
required |
Yields:
Type | Description |
---|---|
str
|
Generator[Tuple[str, CrowdTanglePost | None], None, None]: Target Facebook URL and, if available, result of minet's CrowdTangle API client. |
Source code in minall/enrichment/crowdtangle/get_data.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
|
minall.enrichment.crowdtangle.contexts
Context manager for CrowdTangle's CSV writers, multi-threader, and progress bar.
ContextManager
Source code in minall/enrichment/crowdtangle/contexts.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
|
__enter__()
Start the module's context variables.
Returns:
Type | Description |
---|---|
Tuple[DictWriter, DictWriter, Progress]
|
Tuple[csv.DictWriter, csv.DictWriter, Progress]: CSV writer for post metadata, CSV writer for shared content metadata, rich progress bar. |
Source code in minall/enrichment/crowdtangle/contexts.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
|
__exit__(exc_type, exc_val, exc_tb)
Stop the scraper's context variables.
Source code in minall/enrichment/crowdtangle/contexts.py
68 69 70 71 72 73 74 |
|
__init__(links_file, shared_content_file)
Set up class for scraper's contexts.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
links_file |
Path
|
Path to CSV file for post metadata. |
required |
shared_content_file |
Path
|
Path to CSV file for posts' shared content metadata. |
required |
Source code in minall/enrichment/crowdtangle/contexts.py
23 24 25 26 27 28 29 30 31 |
|
minall.enrichment.crowdtangle.exceptions
Exceptions raised during data collection from CrowdTangle API.
This module contains exceptions raised during data collection from CrowdTangle API. The module contains the following exceptions:
NoPostID
- Neither minet nor minall's adhoc parser successfully recovered the Facebook post's ID.PostNotfound
- CrowdTangle did not return a post matching the given post ID.
minall.enrichment.youtube
Enrichment workflow's YouTube data collection.
Modules exported by this package:
normalizer
: Dataclass to normlalize minet's YouTube result objects.context
: Context manager for client's CSV writer and progress bar.get_data
: Function that runs all of the YouTube enrichment process.
minall.enrichment.youtube.normalizer
Module contains dataclasses to normalize minet's YouTube Video and Channel result objects.
NormalizedYouTubeChannel
dataclass
Bases: TabularRecord
Dataclass to normalize minet's YoutubeChannel result object.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
Target YouTube channel URL. |
identifier |
str
|
YouTube's unique identifier for the channel. |
date_published |
str
|
Date when the channel was created. |
country_of_origin |
str
|
Primary country in which the channel publishes content. |
title |
str
|
Name of the channel. |
abstract |
str
|
Description of the channel. |
keywords |
List[str]
|
List of keywords attributed to the channel. |
youtube_subscribe |
int
|
Number of YouTube users who subscribe to the channel. |
create_video |
int
|
Number of videos the channel has published. |
domain |
str
|
Domain of target URL. Default = "youtube.com". |
work_type |
str
|
Ontological subtype of target web content. Default = "WebPage". |
Source code in minall/enrichment/youtube/normalizer.py
172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 |
|
from_payload(url, channel_result)
classmethod
Parses minet's channel result and creates a normalized dataclass.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
Target YouTube channel URL. |
required |
channel_result |
YouTubeChannel
|
minet's channel results for the target channel URL. |
required |
Returns:
Name | Type | Description |
---|---|---|
NormalizedYouTubeChannel |
NormalizedYouTubeChannel
|
Dataclass that normalizes minet's channel results. |
Source code in minall/enrichment/youtube/normalizer.py
202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 |
|
NormalizedYouTubeVideo
dataclass
Bases: TabularRecord
Dataclass to normalize minet's YoutubeVideo result object.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
Target YouTube video URL. |
identifier |
str
|
YouTube's unique identifier for the video. |
date_published |
str
|
Date the video was published on YouTube. |
duration |
str
|
Duration of the video. |
title |
str
|
Title of the video. |
abstract |
str
|
Video's description. |
keywords |
List
|
List of keywords applied to video. |
youtube_watch |
str
|
Number of users who have watched the YouTube video. |
youtube_comment |
str
|
Number of users who have commented on the YouTube video. |
youtube_like |
str
|
Number of users who have liked the YouTube video. |
creator_type |
str
|
Ontological subtype of the video's channel. |
creator_name |
str
|
Name of the video's channel. |
creator_date_created |
str
|
Date when the video's channel was created. |
creator_location_created |
str
|
Primary country in which the video's channel publishes content. |
creator_identifier |
str
|
YouTube's unique identifier for the video's channel. |
creator_youtube_subscribe |
str
|
Number of YouTube accounts that subscribe to the video's channel. |
creator_create_video |
str
|
Number of videos the video's channel has published. |
domain |
str
|
Domain of target URL. Default = "youtube.com". |
work_type |
str
|
Ontological subtype of target web content. Default = "VideoObject". |
Source code in minall/enrichment/youtube/normalizer.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
|
from_payload(url, channel_result, video_result)
classmethod
Parses minet's data for both a video and channel and creates a normalized dataclass.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
Target YouTube video URL. |
required |
channel_result |
YouTubeChannel | None
|
minet's channel results containing metadata about the target video's channel. |
required |
video_result |
YouTubeVideo
|
minet's video results for the target video. |
required |
Returns:
Name | Type | Description |
---|---|---|
NormalizedYouTubeVideo |
NormalizedYouTubeVideo
|
Dataclass that normalizes and merges video and channel results. |
Source code in minall/enrichment/youtube/normalizer.py
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
|
ParsedLink
Class to store up-to-date metadata about target YouTube URL.
This class's instance variables will be updated during the data collection process to reflect the target URL's video and/or channel metadata. If the target URL is of a video, its ParsedLink
class instance should eventually be mutated to have a value in both the video_result
and channel_result
attributes because 2 API calls will be made. If the target URL is of a channel, its ParsedLink
class instance should eventually be mutated to have a value in the channel_result
attribute, after 1 call to the YouTube API's channels endpoint.
Source code in minall/enrichment/youtube/normalizer.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
|
__init__(url)
Determine type of YouTube web content.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
Target YouTube URL. |
required |
Attributes:
Name | Type | Description |
---|---|---|
link_id |
str
|
Target YouTube URL. |
type |
YoutubeChannel | YoutubeVideo | Any
|
Result of ural's |
video_id |
str | None
|
If a parsed YouTube type is a video, the result's |
channel_id |
str | None
|
If a parsed YouTube type is a channel, the result's |
video_result |
None
|
Empty class instance variable for later storing minet's video result object. |
channel_result |
None
|
Empty class instance variable for later storing minet's channel result object. |
Source code in minall/enrichment/youtube/normalizer.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
|
normalize(parsed_link)
Normalize minet result objects stored in instance variables of the ParsedLink
class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
parsed_link |
ParsedLink
|
Class instance with minet's YouTube API results. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
Dictionary to be added to CSV row for 'links' SQL table. |
Source code in minall/enrichment/youtube/normalizer.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
|
minall.enrichment.youtube.get_data
Module contains function to manage process of collecting and normalizing data about YouTube web content.
get_youtube_data(data, keys, outfile)
Collects and writes metadata about target YouTube videos and channels to a CSV file that will be inserted into 'links' SQL table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
list[str]
|
Set of target YouTube URLs. |
required |
keys |
list[str]
|
Set of keys for YouTube API. |
required |
outfile |
Path
|
Path to CSV file for 'links' SQL table. |
required |
Source code in minall/enrichment/youtube/get_data.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
|
minall.enrichment.youtube.context
Module containing contexts for YouTube data collection's CSV writer and progress bar.
ProgressBar
Context for rich progress bar.
Source code in minall/enrichment/youtube/context.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
|
__enter__()
Start the rich progress bar.
Returns:
Name | Type | Description |
---|---|---|
Progress |
Progress
|
Context variable for rich progress bar. |
Source code in minall/enrichment/youtube/context.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
|
__exit__(exc_type, exc_val, exc_tb)
Stops the progress bar's context variable.
Source code in minall/enrichment/youtube/context.py
42 43 44 |
|
Writer
Context for writing YouTube links metadata to CSV of 'links' SQL table.
Source code in minall/enrichment/youtube/context.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
|
__enter__()
Start the CSV writer's context.
Returns:
Type | Description |
---|---|
DictWriter
|
csv.DictWriter: Context variable for writing CSV rows. |
Source code in minall/enrichment/youtube/context.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
|
__exit__(exc_type, exc_val, exc_tb)
Stops the writer's context variable.
Source code in minall/enrichment/youtube/context.py
73 74 75 76 |
|
__init__(links_file)
Set up class for iteratively writing normalized YouTube results to CSV.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
links_file |
Path
|
Path to the links table CSV file. |
required |
Source code in minall/enrichment/youtube/context.py
50 51 52 53 54 55 56 |
|
minall.enrichment.other_social_media
Module to update ontological subtype for social media posts whose data is not accessible.
minall.enrichment.other_social_media.add_data
Module contains function to write web content ontological subtype information to CSV.
The module contains a function that write the ontological subtype "SocialMediaPosting" and the related target URL to a CSV, which will be inserted into the 'links' SQL table.
add_data(data, outfile)
For the set of target URLs, write the URL and the category "SocialMediaPosting" to a CSV row for insert into the 'links' SQL table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
List[str]
|
Target URLs. |
required |
outfile |
Path
|
Path to CSV file for links. |
required |
Source code in minall/enrichment/other_social_media/add_data.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
minall.enrichment.article_text
Enrichment workflow's HTML scraping features.
Modules exported by this package:
normalizer
: Dataclass to normlalize minet's Trafilatura result object.contexts
: Context manager for scraper's CSV writers, multi-threader, and progress bar.get_data
: Function that runs all of the scraping process.scraper
: Class and helper function for scraping HTML.
minall.enrichment.article_text.normalizer
Dataclass to normlalize minet's Trafilatura result object.
NormalizedScrapedWebPage
dataclass
Bases: TabularRecord
Dataclass to normlalize minet's Trafilatura result object.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
URL targeted for scraping. |
title |
str | None
|
Title scraped from HTML. |
text |
str | None
|
Main text scraped from HTML. |
date_published |
str | None
|
Date scraped from HTML. |
work_type |
str
|
Target URL's ontological subtype. Default = "WebPage". |
Source code in minall/enrichment/article_text/normalizer.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
|
minall.enrichment.article_text.scraper
Class and helper function for scraping HTML.
This module's Scraper
class enhances minet's request()
and extract()
methods by providing additional support for unexpected HTML encodings.
- Uses minet's
request()
method on a target URL to get aResponse
object. - Verifies that the
Response
object is encoded in some form of utf-8. - Extracts the HTML body from the
Response
. [text = response.text()
] - Uses bs4's fool-proof
UnicodeDammit
to parse the exact encoding. [UnicodeDammit(text, "html.parser").declared_html_encoding
] - Gives the encoding to bs4's
BeautifulSoup
to parse the HTML. - Gives the
BeautifulSoup
result to minet'sextract()
method in order to return minet'sTrafilaturaResult
object.
Scraper
Class to manage HTML scraping.
Examples:
>>> scraper = Scraper()
>>> url, result = scraper(url='https://zenodo.org/records/7974793')
>>> url == result.canonical_url
True
>>> result.title
'Minet, a webmining CLI tool & library for python.'
Source code in minall/enrichment/article_text/scraper.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
|
__call__(url)
Requests and scrapes HTML, returning minet's Trafilatura Result object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
Target URL. |
required |
Returns:
Type | Description |
---|---|
Tuple[str, TrafilaturaResult | None]
|
Tuple[str, TrafilaturaResult | None]: The target URL and, if scraping was successful, minet's Trafilatura Result object. |
Source code in minall/enrichment/article_text/scraper.py
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
|
__init__(progress=None, total=None)
If provided the context of a rich progress bar, save it to the class instance and add the task 'Scraping webpage'.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
progress |
Progress | None
|
Context of a rich progress bar instance. Defaults to None. |
None
|
total |
int | None
|
Total number of items treated during progress context. Defaults to None. |
None
|
Source code in minall/enrichment/article_text/scraper.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
|
good_response(response)
Verifies that the response that minet's request method returned is valid for scraping.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
response |
Response
|
Response object returned from minet's request method. |
required |
Returns:
Type | Description |
---|---|
Response | None
|
Response | None: If valid, the Response; otherwise None. |
Source code in minall/enrichment/article_text/scraper.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
|
minall.enrichment.article_text.get_data
Module contains a function that runs the scraping feature.
get_data(data, outfile)
Iterating through the target URLs, scrape data and write to out-file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
list[str]
|
Set of target URLs for scraping. |
required |
outfile |
Path
|
Path to CSV file for writing normalized results. |
required |
Source code in minall/enrichment/article_text/get_data.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
minall.enrichment.article_text.contexts
Context manager for scraper's CSV writers, multi-threader, and progress bar.
ContextManager
Source code in minall/enrichment/article_text/contexts.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
|
__enter__()
Start the scraper's context variables.
Returns:
Type | Description |
---|---|
Tuple[DictWriter, ThreadPoolExecutor, Progress]
|
Tuple[csv.DictWriter, ThreadPoolExecutor, Progress]: Context variables. |
Source code in minall/enrichment/article_text/contexts.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
|
__exit__(exc_type, exc_val, exc_tb)
Stop the scraper's context variables.
Source code in minall/enrichment/article_text/contexts.py
62 63 64 65 66 67 |
|
__init__(links_file)
Set up class for scraper's contexts.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
links_file |
Path
|
Path to out-file for CSV writer. |
required |
Source code in minall/enrichment/article_text/contexts.py
23 24 25 26 27 28 29 |
|