Skip to content

Enrichment tools

The enrichment workflow uses scraping and minet's API clients to collect data about URLs and associated shared content in the SQLite database's links table and shared_content table, respectively. The collection method that minall deploys is dependent on the URL's domain name.

The URLs in the SQLite database's links table are parsed using the ural Python library and grouped in the following subsets:

Subsets of URLs

subset dataclass module
URLs from Facebook Normalized Facebook Post Data minall.enrichment.crowdtangle.get_data
URLs from YouTube Normalized YouTube Video Data, Normalized YouTube Channel Data minall.enrichment.youtube.get_data
URLs from other social media platforms (because they cannot be scraped) NA minall.enrichment.other_social_media.add_data
URLs not from social media platforms (they can be scraped) Normalized Scraped Web Page Data minall.enrichment.article_text.get_data

If the user has a Buzzsumo API token, all URLs, regardless of grouping by domain name, are searched in the Buzzsumo database.

All URLs

subset dataclass module
all URLs Normalized Buzzsumo Exact URL Data minall.enrichment.buzzsumo.get_data

Due to the diversity of data available for different types of URLs and provided by different data sources, an important step in all the enrichment procedures is normalizing the data. Each target URL's metadata, regardless of its domain name, must conform to the SQLite database's links table. Such harmonization of data fields constitutes an important feature of the minall workflow and is not (yet) something replicated in minet.

The following table illustrates which of each data source's data fields are matched to which column in the database's links table.

links SQL table Normalized Scraped Web Page Data Normalized Buzzsumo Exact URL Data Normalized Facebook Post Data Normalized YouTube Video Data Normalized YouTube Channel Data Normalized Tweet
url (TEXT) X X X X X X
domain (TEXT) X X ("facebook.com") X ("youtube.com") X ("youtube.com") X ("twitter.com")
work_type (TEXT) X ("WebPage") X ("WebPage", "Article", "VideoObject") X ("SocialMediaPosting", "ImageObject", "VideoObject") X ("VideoObject") X ("WebPage") X ("SocialMediaPosting")
duration (TEXT) X X X
identifier (TEXT) X X X X
date_published (TEXT) X X X X X X
date_modified (TEXT) X
country_of_origin (TEXT) X
abstract (TEXT) X X X
keywords (TEXT) X X
title (TEXT) X X X X X
text (TEXT) X X X
hashtags (TEXT) X
creator_type (TEXT) X ("defacto:SocialMediaAccount") X ("WebPgae") X ("defacto:SocialMediaAccount")
creator_date_created (TEXT) X X
creator_location_created (TEXT) X X
creator_identifier (TEXT) X X X X
creator_facebook_follow (INTEGER)
creator_facebook_subscribe (INTEGER) X
creator_twitter_follow (INTEGER) X
creator_youtube_subscribe (INTEGER) X
creator_create_video (INTEGER) X
creator_name (TEXT) X X X X
creator_url (TEXT) X
facebook_comment (INTEGER) X X
facebook_like (INTEGER) X
facebook_share (INTEGER) X X
pinterest_share (INTEGER) X
twitter_share (INTEGER) X X
tiktok_share (INTEGER) X
tiktok_comment (INTEGER) X
reddit_engagement (INTEGER) X
youtube_watch (INTEGER) X X
youtube_comment (INTEGER) X
youtube_like (INTEGER) X X
youtube_favorite (INTEGER)
youtube_subscribe (INTEGER) X
create_video (INTEGER) X

Note: creator_facebook_follow does not have any data field feeding to it. I still need to confirm the use of creator_facebook_follow (Facebook accounts' FollowAction might have been made redundant by the SubscribeAction).


minall.enrichment.enrichment

Class for data collection and coalescing.

With the class Enrichment, this module manages the data collection process.

Enrichment

Source code in minall/enrichment/enrichment.py
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
class Enrichment:
    def __init__(
        self,
        links_table: LinksTable,
        shared_content_table: SharedContentTable,
        keys: APIKeys,
    ) -> None:
        """From given API keys and URL data set, filter URLs by domain and initialize data enrichment class.

        Args:
            links_table (BaseTable): BaseTable class instance of SQL table for URL dataset.
            shared_content_table (BaseTable): BaseTable class instance of SQL table for shared content related to URLs in dataset.
            keys (APIKeys): APIKeys class instance of minet API client configurations.
        """

        self.links_table = links_table
        self.shared_content_table = shared_content_table
        self.keys = keys
        self.filtered_links = FilteredLinks(self.links_table)

    def buzzsumo(self):
        """For all URLs, collect data from Buzzsumo and coalesce in the database's 'links' table."""

        if self.keys.buzzsumo_token:
            get_buzzsumo_data(
                data=self.filtered_links.all_links,
                token=self.keys.buzzsumo_token,
                outfile=self.links_table.outfile,
            )
            self.links_table.update_from_csv(datafile=self.links_table.outfile)

    def scraper(self):
        """For select URLs, collect data via scraping and coalesce in the database's 'links' table."""

        # In multiple threads, scrape HTML data and write to a CSV file
        get_article_text(
            data=self.filtered_links.to_scrape, outfile=self.links_table.outfile
        )
        # Coalesce the results in the CSV file to the links table
        self.links_table.update_from_csv(datafile=self.links_table.outfile)

    def other_social_media(self):
        """For select URLs, update the 'work_type' column in the database's 'links' table with the value 'SocialMediaPosting'."""
        # Assign default type to social media post
        add_type_data(
            data=self.filtered_links.other_social, outfile=self.links_table.outfile
        )
        # Coalesce the results in the CSV file to the links table
        self.links_table.update_from_csv(datafile=self.links_table.outfile)

    def twitter(self):
        """For Twitter URLs, scrape data from site and coalesce in teh database's 'links' and 'shared_content' tables."""
        get_twitter_data(
            data=self.filtered_links.twitter,
            links_outfile=self.links_table.outfile,
            shared_content_outfile=self.shared_content_table.outfile,
        )
        # Coalesce the results in the CSV files to the links and shared content tables
        self.links_table.update_from_csv(datafile=self.links_table.outfile)
        self.shared_content_table.update_from_csv(
            datafile=self.shared_content_table.outfile
        )

    def facebook(self):
        """For Facebook URLs, collect data from CrowdTangle and coalesce in the database's 'links' and 'shared_content' tables."""
        if self.keys.crowdtangle_token:
            get_facebook_post_data(
                data=self.filtered_links.facebook,
                token=self.keys.crowdtangle_token,
                rate_limit=self.keys.crowdtangle_rate_limit,
                links_outfile=self.links_table.outfile,
                shared_content_outfile=self.shared_content_table.outfile,
            )
            # Coalesce the results in the CSV files to the links and shared content tables
            self.links_table.update_from_csv(datafile=self.links_table.outfile)
            self.shared_content_table.update_from_csv(
                datafile=self.shared_content_table.outfile
            )

    def youtube(self):
        """For YouTube URLs, collect data from YouTube API and coalesce in the database's 'links' table."""
        if self.keys.youtube_key:
            # In single thread, collect YouTube API data and write to a CSV file
            get_youtube_data(
                data=self.filtered_links.youtube,
                keys=self.keys.youtube_key,
                outfile=self.links_table.outfile,
            )
            # Coalesce the results in the CSV file to the links table
            self.links_table.update_from_csv(datafile=self.links_table.outfile)

    def __call__(self, buzzsumo_only: bool):
        executor = SQLiteWrapper(connection=self.links_table.conn)
        # apply domain to all urls
        for link in self.filtered_links.all_links:
            query, domain = apply_domain(link)
            if query and domain:
                self.links_table.conn
                executor(query=query)

        # Must collect Buzzsumo data first because, when platform-specific data (below)
        # is not None, we want to replace the Buzzsumo data with the latter
        self.buzzsumo()

        if not buzzsumo_only:
            if len(self.filtered_links.youtube) > 0:
                self.youtube()
            if len(self.filtered_links.twitter) > 0:
                self.twitter()
            if len(self.filtered_links.facebook) > 0:
                self.facebook()
            if len(self.filtered_links.other_social) > 0:
                self.other_social_media()
            if len(self.filtered_links.to_scrape) > 0:
                self.scraper()

__init__(links_table, shared_content_table, keys)

From given API keys and URL data set, filter URLs by domain and initialize data enrichment class.

Parameters:

Name Type Description Default
links_table BaseTable

BaseTable class instance of SQL table for URL dataset.

required
shared_content_table BaseTable

BaseTable class instance of SQL table for shared content related to URLs in dataset.

required
keys APIKeys

APIKeys class instance of minet API client configurations.

required
Source code in minall/enrichment/enrichment.py
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def __init__(
    self,
    links_table: LinksTable,
    shared_content_table: SharedContentTable,
    keys: APIKeys,
) -> None:
    """From given API keys and URL data set, filter URLs by domain and initialize data enrichment class.

    Args:
        links_table (BaseTable): BaseTable class instance of SQL table for URL dataset.
        shared_content_table (BaseTable): BaseTable class instance of SQL table for shared content related to URLs in dataset.
        keys (APIKeys): APIKeys class instance of minet API client configurations.
    """

    self.links_table = links_table
    self.shared_content_table = shared_content_table
    self.keys = keys
    self.filtered_links = FilteredLinks(self.links_table)

buzzsumo()

For all URLs, collect data from Buzzsumo and coalesce in the database's 'links' table.

Source code in minall/enrichment/enrichment.py
43
44
45
46
47
48
49
50
51
52
def buzzsumo(self):
    """For all URLs, collect data from Buzzsumo and coalesce in the database's 'links' table."""

    if self.keys.buzzsumo_token:
        get_buzzsumo_data(
            data=self.filtered_links.all_links,
            token=self.keys.buzzsumo_token,
            outfile=self.links_table.outfile,
        )
        self.links_table.update_from_csv(datafile=self.links_table.outfile)

facebook()

For Facebook URLs, collect data from CrowdTangle and coalesce in the database's 'links' and 'shared_content' tables.

Source code in minall/enrichment/enrichment.py
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def facebook(self):
    """For Facebook URLs, collect data from CrowdTangle and coalesce in the database's 'links' and 'shared_content' tables."""
    if self.keys.crowdtangle_token:
        get_facebook_post_data(
            data=self.filtered_links.facebook,
            token=self.keys.crowdtangle_token,
            rate_limit=self.keys.crowdtangle_rate_limit,
            links_outfile=self.links_table.outfile,
            shared_content_outfile=self.shared_content_table.outfile,
        )
        # Coalesce the results in the CSV files to the links and shared content tables
        self.links_table.update_from_csv(datafile=self.links_table.outfile)
        self.shared_content_table.update_from_csv(
            datafile=self.shared_content_table.outfile
        )

other_social_media()

For select URLs, update the 'work_type' column in the database's 'links' table with the value 'SocialMediaPosting'.

Source code in minall/enrichment/enrichment.py
64
65
66
67
68
69
70
71
def other_social_media(self):
    """For select URLs, update the 'work_type' column in the database's 'links' table with the value 'SocialMediaPosting'."""
    # Assign default type to social media post
    add_type_data(
        data=self.filtered_links.other_social, outfile=self.links_table.outfile
    )
    # Coalesce the results in the CSV file to the links table
    self.links_table.update_from_csv(datafile=self.links_table.outfile)

scraper()

For select URLs, collect data via scraping and coalesce in the database's 'links' table.

Source code in minall/enrichment/enrichment.py
54
55
56
57
58
59
60
61
62
def scraper(self):
    """For select URLs, collect data via scraping and coalesce in the database's 'links' table."""

    # In multiple threads, scrape HTML data and write to a CSV file
    get_article_text(
        data=self.filtered_links.to_scrape, outfile=self.links_table.outfile
    )
    # Coalesce the results in the CSV file to the links table
    self.links_table.update_from_csv(datafile=self.links_table.outfile)

twitter()

For Twitter URLs, scrape data from site and coalesce in teh database's 'links' and 'shared_content' tables.

Source code in minall/enrichment/enrichment.py
73
74
75
76
77
78
79
80
81
82
83
84
def twitter(self):
    """For Twitter URLs, scrape data from site and coalesce in teh database's 'links' and 'shared_content' tables."""
    get_twitter_data(
        data=self.filtered_links.twitter,
        links_outfile=self.links_table.outfile,
        shared_content_outfile=self.shared_content_table.outfile,
    )
    # Coalesce the results in the CSV files to the links and shared content tables
    self.links_table.update_from_csv(datafile=self.links_table.outfile)
    self.shared_content_table.update_from_csv(
        datafile=self.shared_content_table.outfile
    )

youtube()

For YouTube URLs, collect data from YouTube API and coalesce in the database's 'links' table.

Source code in minall/enrichment/enrichment.py
102
103
104
105
106
107
108
109
110
111
112
def youtube(self):
    """For YouTube URLs, collect data from YouTube API and coalesce in the database's 'links' table."""
    if self.keys.youtube_key:
        # In single thread, collect YouTube API data and write to a CSV file
        get_youtube_data(
            data=self.filtered_links.youtube,
            keys=self.keys.youtube_key,
            outfile=self.links_table.outfile,
        )
        # Coalesce the results in the CSV file to the links table
        self.links_table.update_from_csv(datafile=self.links_table.outfile)

minall.enrichment.utils

Functions for data collection.

This module provides the following class and functions:

  • get_domain(url) - Parse domain from URL string.
  • apply_domain(url) - Generate SQL query to insert domain into table.
  • FilteredLinks(table) - From SQL table, select subsets of URLs based on domain name.

Selects all URLs from SQL table and returns subsets.

Source code in minall/enrichment/utils.py
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
class FilteredLinks:
    """Selects all URLs from SQL table and returns subsets."""

    def __init__(self, table: LinksTable) -> None:
        """Select and store all URLs from a target SQL table.

        Args:
            table (BaseTable): Target SQL table.
        """
        cursor = table.conn.cursor()
        self.all_links = [
            row[0] for row in cursor.execute(f"SELECT url FROM {table.name}").fetchall()
        ]

    @property
    def twitter(self) -> List[str]:
        """List of URLs from Twitter.

        Returns:
            List[str]: List of URL strings.
        """
        return [url for url in self.all_links if is_twitter_url(url=url)]

    @property
    def youtube(self) -> List[str]:
        """List of URLs from YouTube.

        Returns:
            List[str]: List of URL strings.
        """
        return [url for url in self.all_links if is_youtube_url(url=url)]

    @property
    def facebook(self) -> List[str]:
        """List of URLs from Facebook.

        Returns:
            List[str]: List of URL strings.
        """
        return [url for url in self.all_links if is_facebook_url(url=url)]

    @property
    def other_social(self) -> List[str]:
        """List of URLs from social media platforms.

        Returns:
            List[str]: List of URL strings.
        """
        return [
            url
            for url in self.all_links
            if get_domain(url=url)
            in [
                "facebook.com",
                "youtube.com",
                "tiktok.com",
                "instagram.com",
                "twitter.com",
                "snapchat.com",
            ]
        ]

    @property
    def to_scrape(self) -> List[str]:
        """List of URLs not from social media platforms.

        Returns:
            List[str]: List of URL strings.
        """
        return [
            url
            for url in self.all_links
            if get_domain(url=url)
            not in [
                "facebook.com",
                "youtube.com",
                "tiktok.com",
                "instagram.com",
                "twitter.com",
                "snapchat.com",
            ]
        ]

facebook: List[str] property

List of URLs from Facebook.

Returns:

Type Description
List[str]

List[str]: List of URL strings.

other_social: List[str] property

List of URLs from social media platforms.

Returns:

Type Description
List[str]

List[str]: List of URL strings.

to_scrape: List[str] property

List of URLs not from social media platforms.

Returns:

Type Description
List[str]

List[str]: List of URL strings.

twitter: List[str] property

List of URLs from Twitter.

Returns:

Type Description
List[str]

List[str]: List of URL strings.

youtube: List[str] property

List of URLs from YouTube.

Returns:

Type Description
List[str]

List[str]: List of URL strings.

__init__(table)

Select and store all URLs from a target SQL table.

Parameters:

Name Type Description Default
table BaseTable

Target SQL table.

required
Source code in minall/enrichment/utils.py
67
68
69
70
71
72
73
74
75
76
def __init__(self, table: LinksTable) -> None:
    """Select and store all URLs from a target SQL table.

    Args:
        table (BaseTable): Target SQL table.
    """
    cursor = table.conn.cursor()
    self.all_links = [
        row[0] for row in cursor.execute(f"SELECT url FROM {table.name}").fetchall()
    ]

apply_domain(url)

Compose SQL query to update the domain column of a URL's row in the 'links' SQLite table.

Examples:

>>> apply_domain(url="https://www.youtube.com/channel/MkDocs")
("UPDATE links SET domain = 'youtube.com' WHERE url = 'https://www.youtube.com/channel/MkDocs'", 'youtube.com')

Parameters:

Name Type Description Default
url str

URL string.

required

Returns:

Type Description
Tuple[str | None, str | None]

Tuple[str | None, str | None]: If domain was parsed, a tuple containing the SQL query and domain name.

Source code in minall/enrichment/utils.py
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def apply_domain(url: str) -> Tuple[str | None, str | None]:
    """Compose SQL query to update the domain column of a URL's row in the 'links' SQLite table.

    Examples:
        >>> apply_domain(url="https://www.youtube.com/channel/MkDocs")
        ("UPDATE links SET domain = 'youtube.com' WHERE url = 'https://www.youtube.com/channel/MkDocs'", 'youtube.com')

    Args:
        url (str): URL string.

    Returns:
        Tuple[str | None, str | None]: If domain was parsed, a tuple containing the SQL query and domain name.
    """

    query = None
    domain = get_domain(url)
    if domain:
        query = f"UPDATE {LinksConstants.table_name} SET domain = '{domain}' WHERE {LinksConstants.primary_key} = '{url}'"
    return query, domain

get_domain(url)

Parse the domain name of a given URL string.

Examples:

>>> get_domain(url="https://www.youtube.com/channel/MkDocs")
'youtube.com'

Parameters:

Name Type Description Default
url str

URL string.

required

Returns:

Type Description
str | None

str | None: If successfully parsed, domain name.

Source code in minall/enrichment/utils.py
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def get_domain(url: str) -> str | None:
    """Parse the domain name of a given URL string.

    Examples:
        >>> get_domain(url="https://www.youtube.com/channel/MkDocs")
        'youtube.com'

    Args:
        url (str): URL string.

    Returns:
        str | None: If successfully parsed, domain name.
    """

    domain_name = ural.get_domain_name(url)
    if domain_name in YOUTUBE_DOMAINS:
        domain_name = "youtube.com"
    return domain_name

minall.enrichment.buzzsumo

Enrichment workflow's Buzzsumo data collection.

Modules exported by this package:

  • normalizer: Dataclass to normlalize minet's Buzzsumo result object.
  • contexts: Context manager for client's CSV writers, multi-threader, and progress bar.
  • get_data: Function that runs all of the Buzzsumo enrichment process.
  • client: Wrapper for minet's Buzzsumo API client that normalizes minet's result.

minall.enrichment.buzzsumo.normalizer

Module contains constants for minet's Buzzsumo API client and a dataclass to normalize minet's Buzzsumo result.

NormalizedBuzzsumoResult dataclass

Bases: TabularRecord

Dataclass to normalize minet's Buzzsumo API result.

Attributes:

Name Type Description
url str

Target URL searched in Buzzsumo database.

work_type str

Target URL's ontological subtype, i.e. "WebPage", "Article", "VideoObject".

twitter_share int

Number of times the target URL appeared on Twitter.

facebook_share int

Number of times the target URL appeared on Facebook.

title str

Title of target URL web content.

date_published datetime

Date target URL web content was published.

pinterest_share int

Number of times the target URL appeared on Pinterest.

creator_name str

Entity intellectually responsible for (author of) the target URL's web content.

creator_identifier str

If target URL is a social media post, the platform's identifier for the author.

duration int

If the target URL is of a video, the video's duration.

facebook_comment int

Number of times Facebook users commented on a post containing the target URL.

youtube_watch int

If the target URL is of a YouTube video, number of times YouTube users watched the video.

youtube_like int

If the target URL is of a YouTube video, number of times YouTube users liked the video.

tiktok_share int

If the target URL is of TikTok content, number of shares on TikTok.

tiktok_comment int

If the target URL is of TikTok content, number of times TikTok users commented on the content.

reddit_engagement int

Number of times the target URL appeared on Reddit.

Source code in minall/enrichment/buzzsumo/normalizer.py
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
@dataclass
class NormalizedBuzzsumoResult(TabularRecord):
    """Dataclass to normalize minet's Buzzsumo API result.

    Attributes:
        url (str): Target URL searched in Buzzsumo database.
        work_type (str): Target URL's ontological subtype, i.e. "WebPage", "Article", "VideoObject".
        twitter_share (int): Number of times the target URL appeared on Twitter.
        facebook_share (int): Number of times the target URL appeared on Facebook.
        title (str): Title of target URL web content.
        date_published (datetime): Date target URL web content was published.
        pinterest_share (int): Number of times the target URL appeared on Pinterest.
        creator_name (str): Entity intellectually responsible for (author of) the target URL's web content.
        creator_identifier (str): If target URL is a social media post, the platform's identifier for the author.
        duration (int): If the target URL is of a video, the video's duration.
        facebook_comment (int): Number of times Facebook users commented on a post containing the target URL.
        youtube_watch (int): If the target URL is of a YouTube video, number of times YouTube users watched the video.
        youtube_like (int): If the target URL is of a YouTube video, number of times YouTube users liked the video.
        tiktok_share (int): If the target URL is of TikTok content, number of shares on TikTok.
        tiktok_comment (int): If the target URL is of TikTok content, number of times TikTok users commented on the content.
        reddit_engagement (int): Number of times the target URL appeared on Reddit.
    """

    url: str
    work_type: str
    domain: Optional[str]
    twitter_share: Optional[int]
    facebook_share: Optional[int]
    title: Optional[str]
    date_published: Optional[datetime]
    pinterest_share: Optional[int]
    creator_name: Optional[str]
    creator_identifier: Optional[str]
    duration: Optional[int]
    facebook_comment: Optional[int]
    youtube_watch: Optional[int]
    youtube_like: Optional[int]
    youtube_comment: Optional[int]
    tiktok_share: Optional[int]
    tiktok_comment: Optional[int]
    reddit_engagement: Optional[int]

    @classmethod
    def parse_buzzsumo_type(cls, data: BuzzsumoArticle | None) -> str:
        """Helper function for transforming Buzzsumo's content classification into Schema.org subtype.

        Args:
            data (BuzzsumoArticle | None): If target URL was found in Buzzsumo database, minet's result; otherwise, None.

        Returns:
            str: Schema for web content's subtype.
        """
        video_types = ["is_video"]
        article_types = [
            "is_general_article",
            "is_how_to_article",
            "is_infographic",
            "is_interview",
            "is_list",
            "is_newsletter",
            "is_press_release",
            "is_review",
            "is_what_post",
            "is_why_post",
        ]
        work_type = "WebPage"
        if data:
            for t in video_types:
                if getattr(data, t):
                    return "VideoObject"
            for t in article_types:
                if getattr(data, t):
                    return "Article"
        return work_type

    @classmethod
    def from_payload(
        cls,
        url: str,
        data: BuzzsumoArticle | None,
    ) -> "NormalizedBuzzsumoResult":
        """Parses minet's Buzzsumo result and creates normalized dataclass.

        Args:
            url (str): Target URL searched in Buzzsumo's database.
            data (BuzzsumoArticle | None): If target URL was found in Buzzsumo database, minet's result; otherwise, None.

        Returns:
            NormalizedBuzzsumoResult: Dataclass that normalizes minet's Buzzsumo data.
        """
        if data:
            return NormalizedBuzzsumoResult(
                url=url,
                domain=get_domain(url),
                work_type=cls.parse_buzzsumo_type(data),
                twitter_share=getattr(data, "twitter_shares"),
                facebook_share=getattr(data, "facebook_shares"),
                title=getattr(data, "title"),
                date_published=getattr(data, "published_date"),
                pinterest_share=getattr(data, "pinterest_shares"),
                creator_name=getattr(data, "author_name"),
                creator_identifier=getattr(data, "twitter_user_id"),
                duration=getattr(data, "video_length"),
                facebook_comment=getattr(data, "facebook_comments"),
                youtube_watch=getattr(data, "youtube_views"),
                youtube_like=getattr(data, "youtube_likes"),
                youtube_comment=getattr(data, "youtube_comments"),
                tiktok_share=getattr(data, "tiktok_share_count"),
                tiktok_comment=getattr(data, "tiktok_comment_count"),
                reddit_engagement=getattr(data, "total_reddit_engagements"),
            )
        else:
            return NormalizedBuzzsumoResult(
                url=url,
                work_type=cls.parse_buzzsumo_type(data),
                domain=None,
                twitter_share=None,
                facebook_share=None,
                title=None,
                date_published=None,
                pinterest_share=None,
                creator_name=None,
                creator_identifier=None,
                duration=None,
                facebook_comment=None,
                youtube_watch=None,
                youtube_like=None,
                youtube_comment=None,
                tiktok_share=None,
                tiktok_comment=None,
                reddit_engagement=None,
            )
from_payload(url, data) classmethod

Parses minet's Buzzsumo result and creates normalized dataclass.

Parameters:

Name Type Description Default
url str

Target URL searched in Buzzsumo's database.

required
data BuzzsumoArticle | None

If target URL was found in Buzzsumo database, minet's result; otherwise, None.

required

Returns:

Name Type Description
NormalizedBuzzsumoResult NormalizedBuzzsumoResult

Dataclass that normalizes minet's Buzzsumo data.

Source code in minall/enrichment/buzzsumo/normalizer.py
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
@classmethod
def from_payload(
    cls,
    url: str,
    data: BuzzsumoArticle | None,
) -> "NormalizedBuzzsumoResult":
    """Parses minet's Buzzsumo result and creates normalized dataclass.

    Args:
        url (str): Target URL searched in Buzzsumo's database.
        data (BuzzsumoArticle | None): If target URL was found in Buzzsumo database, minet's result; otherwise, None.

    Returns:
        NormalizedBuzzsumoResult: Dataclass that normalizes minet's Buzzsumo data.
    """
    if data:
        return NormalizedBuzzsumoResult(
            url=url,
            domain=get_domain(url),
            work_type=cls.parse_buzzsumo_type(data),
            twitter_share=getattr(data, "twitter_shares"),
            facebook_share=getattr(data, "facebook_shares"),
            title=getattr(data, "title"),
            date_published=getattr(data, "published_date"),
            pinterest_share=getattr(data, "pinterest_shares"),
            creator_name=getattr(data, "author_name"),
            creator_identifier=getattr(data, "twitter_user_id"),
            duration=getattr(data, "video_length"),
            facebook_comment=getattr(data, "facebook_comments"),
            youtube_watch=getattr(data, "youtube_views"),
            youtube_like=getattr(data, "youtube_likes"),
            youtube_comment=getattr(data, "youtube_comments"),
            tiktok_share=getattr(data, "tiktok_share_count"),
            tiktok_comment=getattr(data, "tiktok_comment_count"),
            reddit_engagement=getattr(data, "total_reddit_engagements"),
        )
    else:
        return NormalizedBuzzsumoResult(
            url=url,
            work_type=cls.parse_buzzsumo_type(data),
            domain=None,
            twitter_share=None,
            facebook_share=None,
            title=None,
            date_published=None,
            pinterest_share=None,
            creator_name=None,
            creator_identifier=None,
            duration=None,
            facebook_comment=None,
            youtube_watch=None,
            youtube_like=None,
            youtube_comment=None,
            tiktok_share=None,
            tiktok_comment=None,
            reddit_engagement=None,
        )
parse_buzzsumo_type(data) classmethod

Helper function for transforming Buzzsumo's content classification into Schema.org subtype.

Parameters:

Name Type Description Default
data BuzzsumoArticle | None

If target URL was found in Buzzsumo database, minet's result; otherwise, None.

required

Returns:

Name Type Description
str str

Schema for web content's subtype.

Source code in minall/enrichment/buzzsumo/normalizer.py
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
@classmethod
def parse_buzzsumo_type(cls, data: BuzzsumoArticle | None) -> str:
    """Helper function for transforming Buzzsumo's content classification into Schema.org subtype.

    Args:
        data (BuzzsumoArticle | None): If target URL was found in Buzzsumo database, minet's result; otherwise, None.

    Returns:
        str: Schema for web content's subtype.
    """
    video_types = ["is_video"]
    article_types = [
        "is_general_article",
        "is_how_to_article",
        "is_infographic",
        "is_interview",
        "is_list",
        "is_newsletter",
        "is_press_release",
        "is_review",
        "is_what_post",
        "is_why_post",
    ]
    work_type = "WebPage"
    if data:
        for t in video_types:
            if getattr(data, t):
                return "VideoObject"
        for t in article_types:
            if getattr(data, t):
                return "Article"
    return work_type

minall.enrichment.buzzsumo.client

Module containing a wrapper for minet's Buzzsumo API client.

BuzzsumoClient

Wrapper for minet's Buzzsumo API client.

Examples:

>>> import os
>>> wrapper = BuzzsumoClient(token=os.environ["BUZZSUMO_TOKEN"])
>>> url="https://archive.fosdem.org/2020/schedule/event/open_research_web_mining/"
>>> wrapper(url)
NormalizedBuzzsumoResult(url='https://archive.fosdem.org/2020/schedule/event/open_research_web_mining/', work_type='Article', domain='fosdem.org', twitter_share=0, facebook_share=0, title='FOSDEM 2020 - Empowering social scientists with web mining tools', date_published=datetime.datetime(2024, 1, 4, 15, 48, 1), pinterest_share=0, creator_name=None, creator_identifier=None, duration=None, facebook_comment=0, youtube_watch=None, youtube_like=None, youtube_comment=None, tiktok_share=None, tiktok_comment=None, reddit_engagement=0)
Source code in minall/enrichment/buzzsumo/client.py
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
class BuzzsumoClient:
    """Wrapper for minet's Buzzsumo API client.

    Examples:
        >>> import os
        >>> wrapper = BuzzsumoClient(token=os.environ["BUZZSUMO_TOKEN"])
        >>> url="https://archive.fosdem.org/2020/schedule/event/open_research_web_mining/"
        >>> wrapper(url)
        NormalizedBuzzsumoResult(url='https://archive.fosdem.org/2020/schedule/event/open_research_web_mining/', work_type='Article', domain='fosdem.org', twitter_share=0, facebook_share=0, title='FOSDEM 2020 - Empowering social scientists with web mining tools', date_published=datetime.datetime(2024, 1, 4, 15, 48, 1), pinterest_share=0, creator_name=None, creator_identifier=None, duration=None, facebook_comment=0, youtube_watch=None, youtube_like=None, youtube_comment=None, tiktok_share=None, tiktok_comment=None, reddit_engagement=0)
    """

    def __init__(self, token: str) -> None:
        """Creates an instance of mient's BuzzsumoAPIClient and sets values for the Buzzsumo API's requried begin-date and end-date parameters.

        Examples:
            >>> wrapper = BuzzsumoClient(token="<TOKEN>")
            >>> type(wrapper)
            <class 'minall.enrichment.buzzsumo.client.BuzzsumoClient'>
            >>> type(wrapper.client)
            <class 'minet.buzzsumo.client.BuzzSumoAPIClient'>

        Args:
            token (str): Buzzsumo API token.
        """
        self.client = BuzzSumoAPIClient(token=token)
        self.begin = BEGINDATE
        self.end = ENDDATE

    def __call__(self, url: str) -> NormalizedBuzzsumoResult:
        """Executes mient's Buzzsumo API client on a URL and returns normalized data.

        Args:
            url (str): Target URL.

        Returns:
            NormalizedBuzzsumoResult: Dataclass that normalizes minet's Buzzsumo result.
        """
        result = self.client.exact_url(
            search_url=url, begin_timestamp=self.begin, end_timestamp=self.end
        )
        return NormalizedBuzzsumoResult.from_payload(url, result)
__call__(url)

Executes mient's Buzzsumo API client on a URL and returns normalized data.

Parameters:

Name Type Description Default
url str

Target URL.

required

Returns:

Name Type Description
NormalizedBuzzsumoResult NormalizedBuzzsumoResult

Dataclass that normalizes minet's Buzzsumo result.

Source code in minall/enrichment/buzzsumo/client.py
43
44
45
46
47
48
49
50
51
52
53
54
55
def __call__(self, url: str) -> NormalizedBuzzsumoResult:
    """Executes mient's Buzzsumo API client on a URL and returns normalized data.

    Args:
        url (str): Target URL.

    Returns:
        NormalizedBuzzsumoResult: Dataclass that normalizes minet's Buzzsumo result.
    """
    result = self.client.exact_url(
        search_url=url, begin_timestamp=self.begin, end_timestamp=self.end
    )
    return NormalizedBuzzsumoResult.from_payload(url, result)
__init__(token)

Creates an instance of mient's BuzzsumoAPIClient and sets values for the Buzzsumo API's requried begin-date and end-date parameters.

Examples:

>>> wrapper = BuzzsumoClient(token="<TOKEN>")
>>> type(wrapper)
<class 'minall.enrichment.buzzsumo.client.BuzzsumoClient'>
>>> type(wrapper.client)
<class 'minet.buzzsumo.client.BuzzSumoAPIClient'>

Parameters:

Name Type Description Default
token str

Buzzsumo API token.

required
Source code in minall/enrichment/buzzsumo/client.py
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def __init__(self, token: str) -> None:
    """Creates an instance of mient's BuzzsumoAPIClient and sets values for the Buzzsumo API's requried begin-date and end-date parameters.

    Examples:
        >>> wrapper = BuzzsumoClient(token="<TOKEN>")
        >>> type(wrapper)
        <class 'minall.enrichment.buzzsumo.client.BuzzsumoClient'>
        >>> type(wrapper.client)
        <class 'minet.buzzsumo.client.BuzzSumoAPIClient'>

    Args:
        token (str): Buzzsumo API token.
    """
    self.client = BuzzSumoAPIClient(token=token)
    self.begin = BEGINDATE
    self.end = ENDDATE

minall.enrichment.buzzsumo.get_data

Module containing a function that runs all of the Buzzsumo enrichment process.

get_buzzsumo_data(data, token, outfile)

Main function for writing Buzzsumo API results to a CSV file.

Parameters:

Name Type Description Default
data List[str]

List of URLs.

required
token str

Token for Buzzsumo API.

required
outfile Path

Path to CSV file in which to write results.

required
Source code in minall/enrichment/buzzsumo/get_data.py
14
15
16
17
18
19
20
21
22
23
24
25
def get_buzzsumo_data(data: List[str], token: str, outfile: Path):
    """Main function for writing Buzzsumo API results to a CSV file.

    Args:
        data (List[str]): List of URLs.
        token (str): Token for Buzzsumo API.
        outfile (Path): Path to CSV file in which to write results.
    """
    with WriterContext(links_file=outfile) as writer:
        # Save results to memory, trigger Global Interpreter Lock (GIL)
        for result in yield_buzzsumo_data(token, data):
            writer.writerow(result.as_csv_dict_row())

minall.enrichment.buzzsumo.contexts

Module containing contexts for Buzzsumo data collection's CSV writer, progress bar, and multi-threader.

GeneratorContext

Source code in minall/enrichment/buzzsumo/contexts.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
class GeneratorContext:
    def __init__(self) -> None:
        """Set up class for Buzzsumo client wrapper's contexts."""
        pass

    def __enter__(self) -> Tuple[Progress, ThreadPoolExecutor]:
        """Start the wrapper's context variables.

        Returns:
            Tuple[Progress, ThreadPoolExecutor]: Context variables.
        """
        self.progress_bar = Progress(
            TextColumn("[progress.description]{task.description}"),
            SpinnerColumn(),
            MofNCompleteColumn(),
            TimeElapsedColumn(),
        )
        self.progress_bar.start()

        self.executor = ThreadPoolExecutor()

        return self.progress_bar, self.executor

    def __exit__(self, exc_type, exc_val, exc_tb):
        """Stop the Buzzsumo client wrapper's context variables."""
        self.progress_bar.stop()
        self.executor.shutdown(wait=False, cancel_futures=True)
__enter__()

Start the wrapper's context variables.

Returns:

Type Description
Tuple[Progress, ThreadPoolExecutor]

Tuple[Progress, ThreadPoolExecutor]: Context variables.

Source code in minall/enrichment/buzzsumo/contexts.py
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
def __enter__(self) -> Tuple[Progress, ThreadPoolExecutor]:
    """Start the wrapper's context variables.

    Returns:
        Tuple[Progress, ThreadPoolExecutor]: Context variables.
    """
    self.progress_bar = Progress(
        TextColumn("[progress.description]{task.description}"),
        SpinnerColumn(),
        MofNCompleteColumn(),
        TimeElapsedColumn(),
    )
    self.progress_bar.start()

    self.executor = ThreadPoolExecutor()

    return self.progress_bar, self.executor
__exit__(exc_type, exc_val, exc_tb)

Stop the Buzzsumo client wrapper's context variables.

Source code in minall/enrichment/buzzsumo/contexts.py
75
76
77
78
def __exit__(self, exc_type, exc_val, exc_tb):
    """Stop the Buzzsumo client wrapper's context variables."""
    self.progress_bar.stop()
    self.executor.shutdown(wait=False, cancel_futures=True)
__init__()

Set up class for Buzzsumo client wrapper's contexts.

Source code in minall/enrichment/buzzsumo/contexts.py
53
54
55
def __init__(self) -> None:
    """Set up class for Buzzsumo client wrapper's contexts."""
    pass

WriterContext

Source code in minall/enrichment/buzzsumo/contexts.py
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
class WriterContext:
    def __init__(self, links_file: Path):
        """Set up class for iteratively writing normalized Buzzsumo results to CSV.

        Args:
            links_file (Path): Path to the links table CSV file.
        """
        self.links_file = links_file

    def __enter__(self) -> csv.DictWriter:
        """Start the CSV writer's context.

        Returns:
            csv.DictWriter: Context variable for writing CSV rows.
        """
        # Set up links file writer
        self.links_file_obj = open(self.links_file, mode="w")
        self.links_file_writer = csv.DictWriter(
            self.links_file_obj, fieldnames=LinksConstants.col_names
        )
        self.links_file_writer.writeheader()

        return self.links_file_writer

    def __exit__(self, exc_type, exc_val, exc_tb):
        """Stops the writer's context variable."""
        if self.links_file_obj:
            self.links_file_obj.close()
__enter__()

Start the CSV writer's context.

Returns:

Type Description
DictWriter

csv.DictWriter: Context variable for writing CSV rows.

Source code in minall/enrichment/buzzsumo/contexts.py
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def __enter__(self) -> csv.DictWriter:
    """Start the CSV writer's context.

    Returns:
        csv.DictWriter: Context variable for writing CSV rows.
    """
    # Set up links file writer
    self.links_file_obj = open(self.links_file, mode="w")
    self.links_file_writer = csv.DictWriter(
        self.links_file_obj, fieldnames=LinksConstants.col_names
    )
    self.links_file_writer.writeheader()

    return self.links_file_writer
__exit__(exc_type, exc_val, exc_tb)

Stops the writer's context variable.

Source code in minall/enrichment/buzzsumo/contexts.py
46
47
48
49
def __exit__(self, exc_type, exc_val, exc_tb):
    """Stops the writer's context variable."""
    if self.links_file_obj:
        self.links_file_obj.close()
__init__(links_file)

Set up class for iteratively writing normalized Buzzsumo results to CSV.

Parameters:

Name Type Description Default
links_file Path

Path to the links table CSV file.

required
Source code in minall/enrichment/buzzsumo/contexts.py
23
24
25
26
27
28
29
def __init__(self, links_file: Path):
    """Set up class for iteratively writing normalized Buzzsumo results to CSV.

    Args:
        links_file (Path): Path to the links table CSV file.
    """
    self.links_file = links_file

minall.enrichment.twitter

minall.enrichment.twitter.normalizer

Module contains constants for minet's Twitter Guest API Scraper and a dataclass to normalize minet's result.

Bases: TabularRecord

Dataclass to normalized media embedded in Tweet.

Attributes:

Name Type Description
post_url str

URL of the Tweet in which the content is embedded.

content_url str

URL of the embedded content.

media_type str

Content URL's ontological subtype, i.e. "VideoObject".

height int | None

Height of the embedded visual media content. Default = None.

width int | None

Width of the embedded visual media content. Default = None.

Source code in minall/enrichment/twitter/normalizer.py
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
@dataclass
class NormalizedSharedLink(TabularRecord):
    """Dataclass to normalized media embedded in Tweet.

    Attributes:
        post_url (str): URL of the Tweet in which the content is embedded.
        content_url (str): URL of the embedded content.
        media_type (str): Content URL's ontological subtype, i.e. "VideoObject".
        height (int | None): Height of the embedded visual media content. Default = None.
        width (int | None ): Width of the embedded visual media content. Default = None.
    """

    post_url: str
    content_url: str
    media_type: str
    height: int | None = None
    width: int | None = None

NormalizedTweet dataclass

Bases: TabularRecord

Dataclass to normalize minet's Twitter API result.

Attributes:

Name Type Description
url str

Target URL of Tweet searched on Twitter.

identifier Optional[str]

Tweet ID. Default = None.

date_published Optional[datetime]

UTC timestamp of Tweet publication, parsed as DateTime object. Default = None.

text Optional[str]

Text of Tweet . Default = None.

creator_date_created Optional[datetime]

UTC timestamp of User account creation, parsed as DateTime object. Default = None.

creator_identifier Optional[str]

User account ID . Default = None.

creator_twitter_follow Optional[int]

Number of accounts that follow the User account that published the Tweet. Default = None.

creator_name Optional[str]

Name of the User account. Default = None.

twitter_share Optional[int]

Number of times the Tweet was retweeted. Default = None.

twitter_like Optional[int]

Number of times the Tweet was liked. Default = None.

domain Optional[str]

Domain name of the target URL. Default = "twitter.com".

work_type str

Target URL's ontological subtype. Default = "SocialMediaPosting".

hashtags List

Hashtags embedded in Tweet. Default = [].

creator_type str

Ontological subtype of the User account. Default = "defacto:SocialMediaAccount".

Source code in minall/enrichment/twitter/normalizer.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
@dataclass
class NormalizedTweet(TabularRecord):
    """Dataclass to normalize minet's Twitter API result.

    Attributes:
        url (str): Target URL of Tweet searched on Twitter.
        identifier (Optional[str]): Tweet ID. Default = None.
        date_published (Optional[datetime]): UTC timestamp of Tweet publication, parsed as DateTime object. Default = None.
        text (Optional[str]): Text of Tweet . Default = None.
        creator_date_created (Optional[datetime]): UTC timestamp of User account creation, parsed as DateTime object. Default = None.
        creator_identifier (Optional[str]): User account ID . Default = None.
        creator_twitter_follow (Optional[int]): Number of accounts that follow the User account that published the Tweet. Default = None.
        creator_name (Optional[str]): Name of the User account. Default = None.
        twitter_share (Optional[int]): Number of times the Tweet was retweeted. Default = None.
        twitter_like (Optional[int]): Number of times the Tweet was liked. Default = None.
        domain (Optional[str]): Domain name of the target URL. Default = "twitter.com".
        work_type (str): Target URL's ontological subtype. Default = "SocialMediaPosting".
        hashtags (List): Hashtags embedded in Tweet. Default = [].
        creator_type (str): Ontological subtype of the User account. Default = "defacto:SocialMediaAccount".
    """

    url: str
    identifier: Optional[str] = None
    date_published: Optional[datetime] = None
    text: Optional[str] = None
    creator_date_created: Optional[datetime] = None
    creator_identifier: Optional[str] = None
    creator_twitter_follow: Optional[int] = None
    creator_name: Optional[str] = None
    twitter_share: Optional[int] = None
    twitter_like: Optional[int] = None
    domain: str = "twitter.com"
    work_type: str = "SocialMediaPosting"
    hashtags: List = field(default_factory=lambda: [])
    creator_type: str = "defacto:SocialMediaAccount"

    @classmethod
    def from_payload(cls, url: str, tweet: Dict | None) -> "NormalizedTweet":
        if tweet:
            return NormalizedTweet(
                url=url,
                identifier=tweet["id"],
                date_published=datetime.fromtimestamp(tweet["timestamp_utc"]),
                text=tweet["text"],
                creator_date_created=datetime.fromtimestamp(
                    tweet["user_timestamp_utc"]
                ),
                creator_identifier=tweet["user_id"],
                creator_twitter_follow=tweet["user_followers"],
                creator_name=tweet["user_name"],
                twitter_share=tweet["retweet_count"],
                twitter_like=tweet["like_count"],
                hashtags=tweet["hashtags"],
            )
        else:
            return NormalizedTweet(url=url)

parse_shared_content(url, tweet)

Parse the potentially multiple media embedded in a Tweet and yield each one.

Parameters:

Name Type Description Default
url str

URL of Tweet with embedded media.

required
tweet Dict | None

Full metadata of Tweet.

required

Yields:

Type Description
NormalizedSharedLink | None

Generator[NormalizedSharedLink | None, None, None]: If media could be parsed, its normalized metadata.

Source code in minall/enrichment/twitter/normalizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
def parse_shared_content(
    url: str, tweet: Dict | None
) -> Generator[NormalizedSharedLink | None, None, None]:
    """Parse the potentially multiple media embedded in a Tweet and yield each one.

    Args:
        url (str): URL of Tweet with embedded media.
        tweet (Dict | None): Full metadata of Tweet.

    Yields:
        Generator[NormalizedSharedLink | None, None, None]: If media could be parsed, its normalized metadata.
    """
    if tweet:
        media_types = tweet.get("media_types")
        media_urls = tweet.get("media_urls")
        media_files = tweet.get("media_files")
        if (
            media_types
            and media_urls
            and media_files
            and len(media_files) == len(media_urls)
            and len(media_urls) == len(media_types)
        ):
            medias = list(zip(media_types, media_urls, media_files))
            for media_type, media_url, media_file in medias:
                if media_type == "video":
                    media_type = "VideoObject"
                elif media_type == "photo":
                    media_type = "ImageObject"
                else:
                    media_type = "MediaObject"
                yield NormalizedSharedLink(
                    post_url=url, content_url=media_url, media_type=media_type
                )

minall.enrichment.twitter.scraper

Module contains TweetScraper wrapper for setting up and calling the minet TweetGuestAPIScraper

TweetScraper

Wrapper for running minet's Tweet Guest API Scraper.

Examples:

>>> scraper = TweetScraper()
>>>
>>> tweet_url = "https://twitter.com/Paris2024/status/1551605445156012038"
>>>
>>> url, tweet = scraper(tweet_url)
>>>
>>> url
'https://twitter.com/Paris2024/status/1551605445156012038'
>>>
>>> tweet["local_time"]
'2022-07-25T16:29:00'
Source code in minall/enrichment/twitter/scraper.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
class TweetScraper:
    """Wrapper for running minet's Tweet Guest API Scraper.

    Examples:
        >>> scraper = TweetScraper()
        >>>
        >>> tweet_url = "https://twitter.com/Paris2024/status/1551605445156012038"
        >>>
        >>> url, tweet = scraper(tweet_url)
        >>>
        >>> url
        'https://twitter.com/Paris2024/status/1551605445156012038'
        >>>
        >>> tweet["local_time"]
        '2022-07-25T16:29:00'
    """

    def __init__(self) -> None:
        """Set up minet's Twitter Guest API Scraper"""

        self.scraper = TwitterGuestAPIScraper()

    def __call__(self, url: str) -> Tuple[str, Dict | None]:
        """If the URL is of a Tweet and the ID can be parsed, scrape and return data.

        Args:
            url (str): URL of Tweet.

        Returns:
            Tuple[str, Dict | None]: If data could be scraped, the target URL and the data; otherwise the unsuccessful URL and None.
        """
        result = None
        parsed_url = parse_twitter_url(url)
        if isinstance(parsed_url, TwitterTweet):
            tweet_id = getattr(parsed_url, "id")
            if tweet_id is not None:
                try:
                    result = self.scraper.tweet(tweet_id)
                except Exception:
                    pass
        return url, result
__call__(url)

If the URL is of a Tweet and the ID can be parsed, scrape and return data.

Parameters:

Name Type Description Default
url str

URL of Tweet.

required

Returns:

Type Description
Tuple[str, Dict | None]

Tuple[str, Dict | None]: If data could be scraped, the target URL and the data; otherwise the unsuccessful URL and None.

Source code in minall/enrichment/twitter/scraper.py
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
def __call__(self, url: str) -> Tuple[str, Dict | None]:
    """If the URL is of a Tweet and the ID can be parsed, scrape and return data.

    Args:
        url (str): URL of Tweet.

    Returns:
        Tuple[str, Dict | None]: If data could be scraped, the target URL and the data; otherwise the unsuccessful URL and None.
    """
    result = None
    parsed_url = parse_twitter_url(url)
    if isinstance(parsed_url, TwitterTweet):
        tweet_id = getattr(parsed_url, "id")
        if tweet_id is not None:
            try:
                result = self.scraper.tweet(tweet_id)
            except Exception:
                pass
    return url, result
__init__()

Set up minet's Twitter Guest API Scraper

Source code in minall/enrichment/twitter/scraper.py
28
29
30
31
def __init__(self) -> None:
    """Set up minet's Twitter Guest API Scraper"""

    self.scraper = TwitterGuestAPIScraper()

minall.enrichment.twitter.get_data

Module contains functions for collecting, normalizing, and writing data from Twitter.

get_twitter_data(data, links_outfile, shared_content_outfile)

Transforms a set of Twitter URLs into collected Tweet metadata, written to CSV files.

Parameters:

Name Type Description Default
data List[str]

Set of Twitter ULRs.

required
links_outfile Path

Path to CSV file for Tweet metadata.

required
shared_content_outfile Path

Path to CSV file for metadata about links in Tweets.

required
Source code in minall/enrichment/twitter/get_data.py
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def get_twitter_data(
    data: List[str],
    links_outfile: Path,
    shared_content_outfile: Path,
) -> None:
    """Transforms a set of Twitter URLs into collected Tweet metadata, written to CSV files.

    Args:
        data (List[str]): Set of Twitter ULRs.
        links_outfile (Path): Path to CSV file for Tweet metadata.
        shared_content_outfile (Path): Path to CSV file for metadata about links in Tweets.
    """
    with ContextManager(links_outfile, shared_content_outfile) as contexts:
        links_writer, shared_content_writer, progress = contexts

        t = progress.add_task(description="[bold blue]Querying Tweets", total=len(data))
        scraper = TweetScraper()

        for url in data:
            url, tweet = scraper(url)
            formatted_tweet = NormalizedTweet.from_payload(url=url, tweet=tweet)
            links_writer.writerow(formatted_tweet.as_csv_dict_row())

            # Write the shared content data
            for shared_link in parse_shared_content(url=url, tweet=tweet):
                if shared_link:
                    shared_content_writer.writerow(
                        shared_link.as_csv_dict_row()
                    )

            progress.advance(t)

minall.enrichment.crowdtangle

Enrichment workflow's CrowdTangle data collection.

Modules exported by this package:

  • normalizer: Dataclass to normlalize minet's CrowdTangle result object.
  • contexts: Context manager for client's CSV writers, multi-threader, and progress bar.
  • get_data: Function that runs all of the CrowdTangle enrichment process.
  • client: Wrapper for minet's CrowdTangle API client that normalizes minet's result.
  • exceptions:

minall.enrichment.crowdtangle.normalizer

Module contains functions and dataclasses to normalize minet's CrowdTangle API client result.

NormalizedFacebookPost dataclass

Bases: TabularRecord

Dataclass to normalize minet's CrowdTangle API result for Facebook posts.

Attributes:

Name Type Description
url str

Target Facebook URL.

work_type str

Target Facebook content's ontological subtype, i.e. "SocialMediaPosting", "ImageObject", "VideoObject".

duration str

If a Facebook content is a video, the video's duration.

identifier str

Facebook's identifier for the post.

date_published str

Date of teh Facebook post's publication.

date_modified str

Date when the Facebook post was last modified.

title str

If applicable, title of the Facebook post content.

abstract str

If applicable, description of the Facebook post content.

text str

If applicable, text of Facebook post content.

creator_identifier str

Facebook's identifier for the post's creator.

creator_name str

Name of entity responsible for the Facebook post publication.

creator_location_created str

If available, principal country in which is located the entity responsible for the post's publication.

creator_url str

URL for the entity responsible for the Facebook post's publication.

creator_facebook_subscribe int

Number of Facebook accounts subscribed to the account of the entity responsible for the Facebook post's publication.

facebook_comment int

Number of comments on the Facebook post.

facebook_like int

Number of Facebook accounts that have liked the Facebook post.

facebook_share int

Number of times the Facebook post has been shared on Facebook.

domain str

Domain for the Facebook post's URL. Default = "facebook.com".

creator_type str

Ontological subtype for the Facebook post's creator. Default = "defacto:SocialMediaAccount".

Source code in minall/enrichment/crowdtangle/normalizer.py
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
@dataclass
class NormalizedFacebookPost(TabularRecord):
    """Dataclass to normalize minet's CrowdTangle API result for Facebook posts.

    Attributes:
        url (str): Target Facebook URL.
        work_type (str): Target Facebook content's ontological subtype, i.e. "SocialMediaPosting", "ImageObject", "VideoObject".
        duration (str): If a Facebook content is a video, the video's duration.
        identifier (str): Facebook's identifier for the post.
        date_published (str): Date of teh Facebook post's publication.
        date_modified (str): Date when the Facebook post was last modified.
        title (str): If applicable, title of the Facebook post content.
        abstract (str): If applicable, description of the Facebook post content.
        text (str): If applicable, text of Facebook post content.
        creator_identifier (str): Facebook's identifier for the post's creator.
        creator_name (str): Name of entity responsible for the Facebook post publication.
        creator_location_created (str): If available, principal country in which is located the entity responsible for the post's publication.
        creator_url (str): URL for the entity responsible for the Facebook post's publication.
        creator_facebook_subscribe (int): Number of Facebook accounts subscribed to the account of the entity responsible for the Facebook post's publication.
        facebook_comment (int): Number of comments on the Facebook post.
        facebook_like (int): Number of Facebook accounts that have liked the Facebook post.
        facebook_share (int): Number of times the Facebook post has been shared on Facebook.
        domain (str): Domain for the Facebook post's URL. Default = "facebook.com".
        creator_type (str): Ontological subtype for the Facebook post's creator. Default = "defacto:SocialMediaAccount".
    """

    url: str
    work_type: str
    duration: str
    identifier: str
    date_published: str
    date_modified: str
    title: Optional[str]
    abstract: Optional[str]
    text: Optional[str]
    creator_identifier: str
    creator_name: str
    creator_location_created: Optional[str]
    creator_url: str
    creator_facebook_subscribe: int
    facebook_comment: int
    facebook_like: int
    facebook_share: int
    domain: str = "facebook.com"
    creator_type: str = "defacto:SocialMediaAccount"

    @classmethod
    def from_payload(
        cls,
        url: str,
        result: CrowdTanglePost,
    ) -> "NormalizedFacebookPost":
        """Parses minet's CrowdTangle result and creates normalized dataclass.

        Args:
            url (str): Target Facebook URL.
            result (CrowdTanglePost): Result object returned from minet's CrowdTangle API client.

        Returns:
            NormalizedFacebookPost: Dataclass that normalizes minet's CrowdTangle data.
        """
        work_type = "SocialMediaPosting"
        if hasattr(result, "type"):
            if result.type == "photo":
                work_type = "ImageObject"
            elif result.type == "video":
                work_type = "VideoObject"

        return NormalizedFacebookPost(
            url=url,
            work_type=work_type,
            duration=result.video_length_ms,
            identifier=result.id,
            date_published=result.date,
            date_modified=result.updated,
            title=result.title,
            abstract=result.description,
            text=result.message,
            creator_identifier=result.account.id,
            creator_facebook_subscribe=result.account.subscriber_count,
            creator_name=result.account.name,
            creator_location_created=result.account.page_admin_top_country,
            creator_url=result.account.url,
            facebook_comment=result.actual_comment_count,
            facebook_like=result.actual_like_count,
            facebook_share=result.actual_share_count,
        )
from_payload(url, result) classmethod

Parses minet's CrowdTangle result and creates normalized dataclass.

Parameters:

Name Type Description Default
url str

Target Facebook URL.

required
result CrowdTanglePost

Result object returned from minet's CrowdTangle API client.

required

Returns:

Name Type Description
NormalizedFacebookPost NormalizedFacebookPost

Dataclass that normalizes minet's CrowdTangle data.

Source code in minall/enrichment/crowdtangle/normalizer.py
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
@classmethod
def from_payload(
    cls,
    url: str,
    result: CrowdTanglePost,
) -> "NormalizedFacebookPost":
    """Parses minet's CrowdTangle result and creates normalized dataclass.

    Args:
        url (str): Target Facebook URL.
        result (CrowdTanglePost): Result object returned from minet's CrowdTangle API client.

    Returns:
        NormalizedFacebookPost: Dataclass that normalizes minet's CrowdTangle data.
    """
    work_type = "SocialMediaPosting"
    if hasattr(result, "type"):
        if result.type == "photo":
            work_type = "ImageObject"
        elif result.type == "video":
            work_type = "VideoObject"

    return NormalizedFacebookPost(
        url=url,
        work_type=work_type,
        duration=result.video_length_ms,
        identifier=result.id,
        date_published=result.date,
        date_modified=result.updated,
        title=result.title,
        abstract=result.description,
        text=result.message,
        creator_identifier=result.account.id,
        creator_facebook_subscribe=result.account.subscriber_count,
        creator_name=result.account.name,
        creator_location_created=result.account.page_admin_top_country,
        creator_url=result.account.url,
        facebook_comment=result.actual_comment_count,
        facebook_like=result.actual_like_count,
        facebook_share=result.actual_share_count,
    )

NormalizedSharedContent dataclass

Bases: TabularRecord

Dataclass for normalizing data about media content shared in a Facebook post.

Attributes:

Name Type Description
post_url str

Target Facebook URL, which shared the media content.

content_url str

CrowdTangle's URI for the shared media.

media_type str

Ontological subtype for the shared media, i.e. "ImageObject".

height int | None

If available, the height in pixels of the shared media.

width int | None

If available, the width in pixels of the shared media.

Source code in minall/enrichment/crowdtangle/normalizer.py
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
@dataclass
class NormalizedSharedContent(TabularRecord):
    """Dataclass for normalizing data about media content shared in a Facebook post.

    Attributes:
        post_url (str): Target Facebook URL, which shared the media content.
        content_url (str): CrowdTangle's URI for the shared media.
        media_type (str): Ontological subtype for the shared media, i.e. "ImageObject".
        height (int | None): If available, the height in pixels of the shared media.
        width (int | None): If available, the width in pixels of the shared media.
    """

    post_url: str
    content_url: str | None
    media_type: str
    height: int | None
    width: int | None

    @classmethod
    def parse_media_type(cls, type: str | None) -> str:
        """Helper function to transform CrowdTangle's media classification into Schema.org's CreativeWork subtype.

        Args:
            type (str | None): If available, CrowdTangle's classification of the media object.

        Returns:
            str: Schema.org's CreativeWork subtype.
        """
        if type == "photo":
            return "ImageObject"
        elif type == "video":
            return "VideoObject"
        else:
            return "MediaObject"

    @classmethod
    def from_payload(
        cls,
        url: str,
        media: dict,
    ) -> "NormalizedSharedContent":
        """Parses JSON data in CrowdTanglePost's "media" attribute.

        Args:
            url (str): URL of Facebook post that contains shared media.
            media (dict): JSON in CrowdTanglePost's "media" attribute.

        Returns:
            NormalizedSharedContent: Dataclass that normalizes information about Facebook post's shared media content.
        """
        return NormalizedSharedContent(
            post_url=url,
            content_url=media.get("url"),
            media_type=cls.parse_media_type(media.get("type")),
            height=media.get("height"),
            width=media.get("width"),
        )
from_payload(url, media) classmethod

Parses JSON data in CrowdTanglePost's "media" attribute.

Parameters:

Name Type Description Default
url str

URL of Facebook post that contains shared media.

required
media dict

JSON in CrowdTanglePost's "media" attribute.

required

Returns:

Name Type Description
NormalizedSharedContent NormalizedSharedContent

Dataclass that normalizes information about Facebook post's shared media content.

Source code in minall/enrichment/crowdtangle/normalizer.py
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
@classmethod
def from_payload(
    cls,
    url: str,
    media: dict,
) -> "NormalizedSharedContent":
    """Parses JSON data in CrowdTanglePost's "media" attribute.

    Args:
        url (str): URL of Facebook post that contains shared media.
        media (dict): JSON in CrowdTanglePost's "media" attribute.

    Returns:
        NormalizedSharedContent: Dataclass that normalizes information about Facebook post's shared media content.
    """
    return NormalizedSharedContent(
        post_url=url,
        content_url=media.get("url"),
        media_type=cls.parse_media_type(media.get("type")),
        height=media.get("height"),
        width=media.get("width"),
    )
parse_media_type(type) classmethod

Helper function to transform CrowdTangle's media classification into Schema.org's CreativeWork subtype.

Parameters:

Name Type Description Default
type str | None

If available, CrowdTangle's classification of the media object.

required

Returns:

Name Type Description
str str

Schema.org's CreativeWork subtype.

Source code in minall/enrichment/crowdtangle/normalizer.py
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
@classmethod
def parse_media_type(cls, type: str | None) -> str:
    """Helper function to transform CrowdTangle's media classification into Schema.org's CreativeWork subtype.

    Args:
        type (str | None): If available, CrowdTangle's classification of the media object.

    Returns:
        str: Schema.org's CreativeWork subtype.
    """
    if type == "photo":
        return "ImageObject"
    elif type == "video":
        return "VideoObject"
    else:
        return "MediaObject"

parse_facebook_post(url, result)

Transform minet's CrowdTanglePost object into normalized data as CSV dict row.

Parameters:

Name Type Description Default
url str

Target Facebook URL.

required
result CrowdTanglePost | None

If CrowdTangle API returned a match, minet's CrowdTangle API result object.

required

Returns:

Name Type Description
Dict Dict

Normalized data for Facebook post.

Source code in minall/enrichment/crowdtangle/normalizer.py
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def parse_facebook_post(url: str, result: CrowdTanglePost | None) -> Dict:
    """Transform minet's CrowdTanglePost object into normalized data as CSV dict row.

    Args:
        url (str): Target Facebook URL.
        result (CrowdTanglePost | None): If CrowdTangle API returned a match, minet's CrowdTangle API result object.

    Returns:
        Dict: Normalized data for Facebook post.
    """
    if result:
        formatted_result = NormalizedFacebookPost.from_payload(url, result)
        return formatted_result.as_csv_dict_row()
    else:
        return {"url": url, "domain": "facebook.com", "work_type": "SocialMediaPosting"}

parse_shared_content(url, result)

Generator that streams the "media" attribute from minet's CrowdTanglePost object and returns normalized data as CSV dict row.

Parameters:

Name Type Description Default
url str

Target Facebook URL.

required
result CrowdTanglePost

minet's CrowdTangle API client result object.

required

Yields:

Type Description
Dict

Generator[Dict, None, None]: Formatted CSV dict row of normalized shared content data.

Source code in minall/enrichment/crowdtangle/normalizer.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def parse_shared_content(
    url: str, result: CrowdTanglePost
) -> Generator[Dict, None, None]:
    """Generator that streams the "media" attribute from minet's CrowdTanglePost object and returns normalized data as CSV dict row.

    Args:
        url (str): Target Facebook URL.
        result (CrowdTanglePost): minet's CrowdTangle API client result object.

    Yields:
        Generator[Dict, None, None]: Formatted CSV dict row of normalized shared content data.
    """
    if result and isinstance(getattr(result, "media"), list):
        for media in result.media:
            formatted_result = NormalizedSharedContent.from_payload(
                url=url, media=media
            )
            yield formatted_result.as_csv_dict_row()

minall.enrichment.crowdtangle.client

Module contains a client and helper functions for collecting data from CrowdTangle.

CTClient

Wrapper for minet's CrowdTangle API client with helper function for parsing Facebook post ID.

Source code in minall/enrichment/crowdtangle/client.py
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
class CTClient:
    """Wrapper for minet's CrowdTangle API client with helper function for parsing Facebook post ID."""

    def __init__(self, token: str, rate_limit: int) -> None:
        """Create instance of minet's CrowdTangle API client.

        Args:
            token (str): CrowdTangle API token.
            rate_limit (int): CrowdTangle API rate limit.
        """
        self.client = CrowdTangleAPIClient(token=token, rate_limit=rate_limit)

    def __call__(self, url: str) -> Tuple[str, CrowdTanglePost | None]:
        """Execute collection of CrowdTangle data from parsed Facebook post ID.

        Args:
            url (str): Target Facebook URL.

        Returns:
            Tuple[str, CrowdTanglePost | None]: Target URL and, if successful, minet's CrowdTanglePost result object.
        """
        post_id = adhoc_post_id_parser(url)
        post = None
        if post_id:
            try:
                post = self.client.post(post_id=post_id)
            except Exception as e:
                logging.exception(e)
        return url, post
__call__(url)

Execute collection of CrowdTangle data from parsed Facebook post ID.

Parameters:

Name Type Description Default
url str

Target Facebook URL.

required

Returns:

Type Description
Tuple[str, CrowdTanglePost | None]

Tuple[str, CrowdTanglePost | None]: Target URL and, if successful, minet's CrowdTanglePost result object.

Source code in minall/enrichment/crowdtangle/client.py
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
def __call__(self, url: str) -> Tuple[str, CrowdTanglePost | None]:
    """Execute collection of CrowdTangle data from parsed Facebook post ID.

    Args:
        url (str): Target Facebook URL.

    Returns:
        Tuple[str, CrowdTanglePost | None]: Target URL and, if successful, minet's CrowdTanglePost result object.
    """
    post_id = adhoc_post_id_parser(url)
    post = None
    if post_id:
        try:
            post = self.client.post(post_id=post_id)
        except Exception as e:
            logging.exception(e)
    return url, post
__init__(token, rate_limit)

Create instance of minet's CrowdTangle API client.

Parameters:

Name Type Description Default
token str

CrowdTangle API token.

required
rate_limit int

CrowdTangle API rate limit.

required
Source code in minall/enrichment/crowdtangle/client.py
71
72
73
74
75
76
77
78
def __init__(self, token: str, rate_limit: int) -> None:
    """Create instance of minet's CrowdTangle API client.

    Args:
        token (str): CrowdTangle API token.
        rate_limit (int): CrowdTangle API rate limit.
    """
    self.client = CrowdTangleAPIClient(token=token, rate_limit=rate_limit)

adhoc_post_id_parser(url)

Helper function to catch and fix problems parsing Facebook post ID.

Parameters:

Name Type Description Default
url str

Target Facebook URL.

required

Returns:

Type Description
str | None

str | None: If successful, post ID for target Facebook URL.

Source code in minall/enrichment/crowdtangle/client.py
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def adhoc_post_id_parser(url: str) -> str | None:
    """Helper function to catch and fix problems parsing Facebook post ID.

    Args:
        url (str): Target Facebook URL.

    Returns:
        str | None: If successful, post ID for target Facebook URL.
    """
    post_id = post_id_from_url(url)
    if post_id:
        return post_id
    else:
        parsed_url = parse_facebook_url(url)
        if parsed_url:
            if hasattr(parsed_url, "id"):
                post_id = getattr(parsed_url, "id")
            else:
                return
            if hasattr(parsed_url, "parent_handle"):
                parent_id = getattr(parsed_url, "parent_handle")
            elif hasattr(parsed_url, "parent_id"):
                parent_id = getattr(parsed_url, "parent_id")
            else:
                return
            if post_id and parent_id:
                try:
                    int(post_id)
                except Exception:
                    return
                try:
                    int(parent_id)
                except Exception:
                    return
                return post_id + "_" + parent_id

parse_rate_limit(rate_limit)

Set default or convert rate limit string to integer.

Parameters:

Name Type Description Default
rate_limit int | str | None

Value of rate limit for CrowdTangle API.

required

Returns:

Name Type Description
int int

Converted rate limit integer for CrowdTangle API.

Source code in minall/enrichment/crowdtangle/client.py
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def parse_rate_limit(rate_limit: int | str | None) -> int:
    """Set default or convert rate limit string to integer.

    Args:
        rate_limit (int | str | None): Value of rate limit for CrowdTangle API.

    Returns:
        int: Converted rate limit integer for CrowdTangle API.
    """
    if not rate_limit:
        rate_limit = 10
    elif isinstance(rate_limit, str):
        rate_limit = int(rate_limit)
    return rate_limit

minall.enrichment.crowdtangle.get_data

Module contains functions for collecting, normalizing, and writing data from CrowdTangle.

get_facebook_post_data(data, token, rate_limit, links_outfile, shared_content_outfile)

Function to collect, normalize, and write data from CrowdTangle.

Parameters:

Name Type Description Default
data List[str]

Set of target Facebook URLs.

required
token str

CrowdTangle API token.

required
rate_limit int | str | None

CrowdTangle API rate limit.

required
links_outfile Path

Path to CSV file for Facebook post metadata.

required
shared_content_outfile Path

Path to CSV for shared content metadata.

required
Source code in minall/enrichment/crowdtangle/get_data.py
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def get_facebook_post_data(
    data: List[str],
    token: str,
    rate_limit: int | str | None,
    links_outfile: Path,
    shared_content_outfile: Path,
):
    """Function to collect, normalize, and write data from CrowdTangle.

    Args:
        data (List[str]): Set of target Facebook URLs.
        token (str): CrowdTangle API token.
        rate_limit (int | str | None): CrowdTangle API rate limit.
        links_outfile (Path): Path to CSV file for Facebook post metadata.
        shared_content_outfile (Path): Path to CSV for shared content metadata.
    """

    rate_limit = parse_rate_limit(rate_limit)

    with ContextManager(links_outfile, shared_content_outfile) as contexts:
        links_writer, shared_content_writer, progress = contexts

        t = progress.add_task(
            description="[bold blue]Querying Facebook posts", total=len(data)
        )
        for url, response in yield_facebook_data(
            data=data, token=token, rate_limit=rate_limit
        ):
            progress.advance(t)
            formatted_post = parse_facebook_post(url=url, result=response)
            links_writer.writerow(formatted_post)
            for formatted_media in parse_shared_content(url=url, result=response):
                shared_content_writer.writerow(formatted_media)

yield_facebook_data(data, token, rate_limit)

Streams target Facebook URLs to multi-threading context and yields minet's results.

Parameters:

Name Type Description Default
data List[str]

Set of target Facebook URLs.

required
token str

CrowdTangle API token.

required
rate_limit int

CrowdTangle API rate limit.

required

Yields:

Type Description
str

Generator[Tuple[str, CrowdTanglePost | None], None, None]: Target Facebook URL and, if available, result of minet's CrowdTangle API client.

Source code in minall/enrichment/crowdtangle/get_data.py
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
def yield_facebook_data(
    data: List[str], token: str, rate_limit: int
) -> Generator[Tuple[str, Any], None, None]:
    """Streams target Facebook URLs to multi-threading context and yields minet's results.

    Args:
        data (List[str]): Set of target Facebook URLs.
        token (str): CrowdTangle API token.
        rate_limit (int): CrowdTangle API rate limit.

    Yields:
        Generator[Tuple[str, CrowdTanglePost | None], None, None]: Target Facebook URL and, if available, result of minet's CrowdTangle API client.
    """
    client = CTClient(token=token, rate_limit=rate_limit)
    with ThreadPoolExecutor() as executor:
        for url, response in executor.map(client, data):
            yield url, response

minall.enrichment.crowdtangle.contexts

Context manager for CrowdTangle's CSV writers, multi-threader, and progress bar.

ContextManager

Source code in minall/enrichment/crowdtangle/contexts.py
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
class ContextManager:
    def __init__(self, links_file: Path, shared_content_file: Path):
        """Set up class for scraper's contexts.

        Args:
            links_file (Path): Path to CSV file for post metadata.
            shared_content_file (Path): Path to CSV file for posts' shared content metadata.
        """
        self.links_file = links_file
        self.shared_content_file = shared_content_file

    def __enter__(self) -> Tuple[csv.DictWriter, csv.DictWriter, Progress]:
        """Start the module's context variables.

        Returns:
            Tuple[csv.DictWriter, csv.DictWriter, Progress]: CSV writer for post metadata, CSV writer for shared content metadata, rich progress bar.
        """
        # Set up links file writer
        self.links_file_obj = open(self.links_file, mode="w")
        self.links_file_writer = csv.DictWriter(
            self.links_file_obj, fieldnames=LinksConstants.col_names
        )
        self.links_file_writer.writeheader()

        # Set up shared_content file writer
        self.shared_content_obj = open(self.shared_content_file, mode="w")
        self.shared_content_writer = csv.DictWriter(
            self.shared_content_obj, fieldnames=ShareContentConstants.col_names
        )
        self.shared_content_writer.writeheader()

        # Set up progress bar
        self.progress_bar = Progress(
            TextColumn("[progress.description]{task.description}"),
            SpinnerColumn(),
            MofNCompleteColumn(),
            TimeElapsedColumn(),
        )
        self.progress_bar.start()

        return (
            self.links_file_writer,
            self.shared_content_writer,
            self.progress_bar,
        )

    def __exit__(self, exc_type, exc_val, exc_tb):
        """Stop the scraper's context variables."""
        if self.shared_content_obj:
            self.shared_content_obj.close()
        if self.links_file_obj:
            self.links_file_obj.close()
        self.progress_bar.stop()
__enter__()

Start the module's context variables.

Returns:

Type Description
Tuple[DictWriter, DictWriter, Progress]

Tuple[csv.DictWriter, csv.DictWriter, Progress]: CSV writer for post metadata, CSV writer for shared content metadata, rich progress bar.

Source code in minall/enrichment/crowdtangle/contexts.py
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def __enter__(self) -> Tuple[csv.DictWriter, csv.DictWriter, Progress]:
    """Start the module's context variables.

    Returns:
        Tuple[csv.DictWriter, csv.DictWriter, Progress]: CSV writer for post metadata, CSV writer for shared content metadata, rich progress bar.
    """
    # Set up links file writer
    self.links_file_obj = open(self.links_file, mode="w")
    self.links_file_writer = csv.DictWriter(
        self.links_file_obj, fieldnames=LinksConstants.col_names
    )
    self.links_file_writer.writeheader()

    # Set up shared_content file writer
    self.shared_content_obj = open(self.shared_content_file, mode="w")
    self.shared_content_writer = csv.DictWriter(
        self.shared_content_obj, fieldnames=ShareContentConstants.col_names
    )
    self.shared_content_writer.writeheader()

    # Set up progress bar
    self.progress_bar = Progress(
        TextColumn("[progress.description]{task.description}"),
        SpinnerColumn(),
        MofNCompleteColumn(),
        TimeElapsedColumn(),
    )
    self.progress_bar.start()

    return (
        self.links_file_writer,
        self.shared_content_writer,
        self.progress_bar,
    )
__exit__(exc_type, exc_val, exc_tb)

Stop the scraper's context variables.

Source code in minall/enrichment/crowdtangle/contexts.py
68
69
70
71
72
73
74
def __exit__(self, exc_type, exc_val, exc_tb):
    """Stop the scraper's context variables."""
    if self.shared_content_obj:
        self.shared_content_obj.close()
    if self.links_file_obj:
        self.links_file_obj.close()
    self.progress_bar.stop()
__init__(links_file, shared_content_file)

Set up class for scraper's contexts.

Parameters:

Name Type Description Default
links_file Path

Path to CSV file for post metadata.

required
shared_content_file Path

Path to CSV file for posts' shared content metadata.

required
Source code in minall/enrichment/crowdtangle/contexts.py
23
24
25
26
27
28
29
30
31
def __init__(self, links_file: Path, shared_content_file: Path):
    """Set up class for scraper's contexts.

    Args:
        links_file (Path): Path to CSV file for post metadata.
        shared_content_file (Path): Path to CSV file for posts' shared content metadata.
    """
    self.links_file = links_file
    self.shared_content_file = shared_content_file

minall.enrichment.crowdtangle.exceptions

Exceptions raised during data collection from CrowdTangle API.

This module contains exceptions raised during data collection from CrowdTangle API. The module contains the following exceptions:

  • NoPostID - Neither minet nor minall's adhoc parser successfully recovered the Facebook post's ID.
  • PostNotfound - CrowdTangle did not return a post matching the given post ID.

minall.enrichment.youtube

Enrichment workflow's YouTube data collection.

Modules exported by this package:

  • normalizer: Dataclass to normlalize minet's YouTube result objects.
  • context: Context manager for client's CSV writer and progress bar.
  • get_data: Function that runs all of the YouTube enrichment process.

minall.enrichment.youtube.normalizer

Module contains dataclasses to normalize minet's YouTube Video and Channel result objects.

NormalizedYouTubeChannel dataclass

Bases: TabularRecord

Dataclass to normalize minet's YoutubeChannel result object.

Attributes:

Name Type Description
url str

Target YouTube channel URL.

identifier str

YouTube's unique identifier for the channel.

date_published str

Date when the channel was created.

country_of_origin str

Primary country in which the channel publishes content.

title str

Name of the channel.

abstract str

Description of the channel.

keywords List[str]

List of keywords attributed to the channel.

youtube_subscribe int

Number of YouTube users who subscribe to the channel.

create_video int

Number of videos the channel has published.

domain str

Domain of target URL. Default = "youtube.com".

work_type str

Ontological subtype of target web content. Default = "WebPage".

Source code in minall/enrichment/youtube/normalizer.py
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
@dataclass
class NormalizedYouTubeChannel(TabularRecord):
    """Dataclass to normalize minet's YoutubeChannel result object.

    Attributes:
        url (str): Target YouTube channel URL.
        identifier (str): YouTube's unique identifier for the channel.
        date_published (str): Date when the channel was created.
        country_of_origin (str): Primary country in which the channel publishes content.
        title (str): Name of the channel.
        abstract (str): Description of the channel.
        keywords (List[str]): List of keywords attributed to the channel.
        youtube_subscribe (int): Number of YouTube users who subscribe to the channel.
        create_video (int): Number of videos the channel has published.
        domain (str): Domain of target URL. Default = "youtube.com".
        work_type (str): Ontological subtype of target web content. Default = "WebPage".
    """

    url: str
    identifier: str
    date_published: str
    country_of_origin: str
    title: str
    abstract: str
    keywords: List[str]
    youtube_subscribe: int
    create_video: int
    domain: str = "youtube.com"
    work_type: str = "WebPage"

    @classmethod
    def from_payload(
        cls,
        url: str,
        channel_result: MinetYouTubeChannelResult,
    ) -> "NormalizedYouTubeChannel":
        """Parses minet's channel result and creates a normalized dataclass.

        Args:
            url (str): Target YouTube channel URL.
            channel_result (MinetYouTubeChannelResult): minet's channel results for the target channel URL.

        Returns:
            NormalizedYouTubeChannel: Dataclass that normalizes minet's channel results.
        """
        return NormalizedYouTubeChannel(
            url=url,
            domain="youtube.com",
            work_type="WebPage",
            identifier=channel_result.channel_id,
            date_published=channel_result.published_at,
            country_of_origin=channel_result.country,
            title=channel_result.title,
            abstract=channel_result.description,
            keywords=channel_result.keywords,
            youtube_subscribe=channel_result.subscriber_count,
            create_video=channel_result.video_count,
        )
from_payload(url, channel_result) classmethod

Parses minet's channel result and creates a normalized dataclass.

Parameters:

Name Type Description Default
url str

Target YouTube channel URL.

required
channel_result YouTubeChannel

minet's channel results for the target channel URL.

required

Returns:

Name Type Description
NormalizedYouTubeChannel NormalizedYouTubeChannel

Dataclass that normalizes minet's channel results.

Source code in minall/enrichment/youtube/normalizer.py
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
@classmethod
def from_payload(
    cls,
    url: str,
    channel_result: MinetYouTubeChannelResult,
) -> "NormalizedYouTubeChannel":
    """Parses minet's channel result and creates a normalized dataclass.

    Args:
        url (str): Target YouTube channel URL.
        channel_result (MinetYouTubeChannelResult): minet's channel results for the target channel URL.

    Returns:
        NormalizedYouTubeChannel: Dataclass that normalizes minet's channel results.
    """
    return NormalizedYouTubeChannel(
        url=url,
        domain="youtube.com",
        work_type="WebPage",
        identifier=channel_result.channel_id,
        date_published=channel_result.published_at,
        country_of_origin=channel_result.country,
        title=channel_result.title,
        abstract=channel_result.description,
        keywords=channel_result.keywords,
        youtube_subscribe=channel_result.subscriber_count,
        create_video=channel_result.video_count,
    )

NormalizedYouTubeVideo dataclass

Bases: TabularRecord

Dataclass to normalize minet's YoutubeVideo result object.

Attributes:

Name Type Description
url str

Target YouTube video URL.

identifier str

YouTube's unique identifier for the video.

date_published str

Date the video was published on YouTube.

duration str

Duration of the video.

title str

Title of the video.

abstract str

Video's description.

keywords List

List of keywords applied to video.

youtube_watch str

Number of users who have watched the YouTube video.

youtube_comment str

Number of users who have commented on the YouTube video.

youtube_like str

Number of users who have liked the YouTube video.

creator_type str

Ontological subtype of the video's channel.

creator_name str

Name of the video's channel.

creator_date_created str

Date when the video's channel was created.

creator_location_created str

Primary country in which the video's channel publishes content.

creator_identifier str

YouTube's unique identifier for the video's channel.

creator_youtube_subscribe str

Number of YouTube accounts that subscribe to the video's channel.

creator_create_video str

Number of videos the video's channel has published.

domain str

Domain of target URL. Default = "youtube.com".

work_type str

Ontological subtype of target web content. Default = "VideoObject".

Source code in minall/enrichment/youtube/normalizer.py
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
@dataclass
class NormalizedYouTubeVideo(TabularRecord):
    """Dataclass to normalize minet's YoutubeVideo result object.

    Attributes:
        url (str): Target YouTube video URL.
        identifier (str): YouTube's unique identifier for the video.
        date_published (str): Date the video was published on YouTube.
        duration (str): Duration of the video.
        title (str): Title of the video.
        abstract (str): Video's description.
        keywords (List): List of keywords applied to video.
        youtube_watch (str): Number of users who have watched the YouTube video.
        youtube_comment (str): Number of users who have commented on the YouTube video.
        youtube_like (str): Number of users who have liked the YouTube video.
        creator_type (str): Ontological subtype of the video's channel.
        creator_name (str): Name of the video's channel.
        creator_date_created (str): Date when the video's channel was created.
        creator_location_created (str): Primary country in which the video's channel publishes content.
        creator_identifier (str): YouTube's unique identifier for the video's channel.
        creator_youtube_subscribe (str): Number of YouTube accounts that subscribe to the video's channel.
        creator_create_video (str): Number of videos the video's channel has published.
        domain (str): Domain of target URL. Default = "youtube.com".
        work_type (str): Ontological subtype of target web content. Default = "VideoObject".
    """

    url: str
    identifier: str
    date_published: str
    duration: str
    title: str
    abstract: str
    keywords: List[str]
    youtube_watch: str
    youtube_comment: str
    youtube_like: str
    # youtube_favorite was depreciated by YouTube in 2015.
    creator_type: str
    creator_name: str
    creator_date_created: str
    creator_location_created: str
    creator_identifier: str
    creator_youtube_subscribe: str
    creator_create_video: str
    domain: str = "youtube.com"
    work_type: str = "VideoObject"

    @classmethod
    def from_payload(
        cls,
        url: str,
        channel_result: MinetYouTubeChannelResult | None,
        video_result: MinetYouTubeVideoResult,
    ) -> "NormalizedYouTubeVideo":
        """Parses minet's data for both a video and channel and creates a normalized dataclass.

        Args:
            url (str): Target YouTube video URL.
            channel_result (MinetYouTubeChannelResult | None): minet's channel results containing metadata about the target video's channel.
            video_result (MinetYouTubeVideoResult): minet's video results for the target video.

        Returns:
            NormalizedYouTubeVideo: Dataclass that normalizes and merges video and channel results.
        """
        if channel_result:
            channel = channel_result.as_csv_dict_row()
        else:
            channel = {}
        return NormalizedYouTubeVideo(
            url=url,
            domain="youtube.com",
            work_type="VideoObject",
            identifier=video_result.video_id,
            date_published=video_result.published_at,
            duration=video_result.duration,
            title=video_result.title,
            abstract=video_result.description,
            keywords=channel.get("keywords"),  # type: ignore
            youtube_watch=video_result.view_count,  # type: ignore
            youtube_comment=video_result.comment_count,  # type: ignore
            youtube_like=video_result.like_count,  # type: ignore
            creator_type="WebPage",
            creator_name=video_result.channel_title,
            creator_identifier=video_result.channel_id,
            creator_date_created=channel.get("published_at"),  # type: ignore
            creator_location_created=channel.get("country"),  # type: ignore
            creator_youtube_subscribe=channel.get("subscriber_count"),  # type: ignore
            creator_create_video=channel.get("video_count"),  # type: ignore
        )
from_payload(url, channel_result, video_result) classmethod

Parses minet's data for both a video and channel and creates a normalized dataclass.

Parameters:

Name Type Description Default
url str

Target YouTube video URL.

required
channel_result YouTubeChannel | None

minet's channel results containing metadata about the target video's channel.

required
video_result YouTubeVideo

minet's video results for the target video.

required

Returns:

Name Type Description
NormalizedYouTubeVideo NormalizedYouTubeVideo

Dataclass that normalizes and merges video and channel results.

Source code in minall/enrichment/youtube/normalizer.py
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
@classmethod
def from_payload(
    cls,
    url: str,
    channel_result: MinetYouTubeChannelResult | None,
    video_result: MinetYouTubeVideoResult,
) -> "NormalizedYouTubeVideo":
    """Parses minet's data for both a video and channel and creates a normalized dataclass.

    Args:
        url (str): Target YouTube video URL.
        channel_result (MinetYouTubeChannelResult | None): minet's channel results containing metadata about the target video's channel.
        video_result (MinetYouTubeVideoResult): minet's video results for the target video.

    Returns:
        NormalizedYouTubeVideo: Dataclass that normalizes and merges video and channel results.
    """
    if channel_result:
        channel = channel_result.as_csv_dict_row()
    else:
        channel = {}
    return NormalizedYouTubeVideo(
        url=url,
        domain="youtube.com",
        work_type="VideoObject",
        identifier=video_result.video_id,
        date_published=video_result.published_at,
        duration=video_result.duration,
        title=video_result.title,
        abstract=video_result.description,
        keywords=channel.get("keywords"),  # type: ignore
        youtube_watch=video_result.view_count,  # type: ignore
        youtube_comment=video_result.comment_count,  # type: ignore
        youtube_like=video_result.like_count,  # type: ignore
        creator_type="WebPage",
        creator_name=video_result.channel_title,
        creator_identifier=video_result.channel_id,
        creator_date_created=channel.get("published_at"),  # type: ignore
        creator_location_created=channel.get("country"),  # type: ignore
        creator_youtube_subscribe=channel.get("subscriber_count"),  # type: ignore
        creator_create_video=channel.get("video_count"),  # type: ignore
    )

Class to store up-to-date metadata about target YouTube URL.

This class's instance variables will be updated during the data collection process to reflect the target URL's video and/or channel metadata. If the target URL is of a video, its ParsedLink class instance should eventually be mutated to have a value in both the video_result and channel_result attributes because 2 API calls will be made. If the target URL is of a channel, its ParsedLink class instance should eventually be mutated to have a value in the channel_result attribute, after 1 call to the YouTube API's channels endpoint.

Source code in minall/enrichment/youtube/normalizer.py
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
class ParsedLink:
    """Class to store up-to-date metadata about target YouTube URL.

    This class's instance variables will be updated during the data collection process to reflect the target URL's video and/or channel metadata. If the target URL is of a video, its `ParsedLink` class instance should eventually be mutated to have a value in both the `video_result` and `channel_result` attributes because 2 API calls will be made. If the target URL is of a channel, its `ParsedLink` class instance should eventually be mutated to have a value in the `channel_result` attribute, after 1 call to the YouTube API's channels endpoint.
    """

    def __init__(self, url: str) -> None:
        """Determine type of YouTube web content.

        Args:
            url (str): Target YouTube URL.

        Attributes:
            link_id (str): Target YouTube URL.
            type (YoutubeChannel | YoutubeVideo | Any): Result of ural's `parse_youtube_url()` function.
            video_id (str | None): If a parsed YouTube type is a video, the result's `id` attribute.
            channel_id (str | None): If a parsed YouTube type is a channel, the result's `id` attribute.
            video_result (None): Empty class instance variable for later storing minet's video result object.
            channel_result (None): Empty class instance variable for later storing minet's channel result object.
        """
        self.link_id = url

        self.type = parse_youtube_url(url)

        if isinstance(self.type, YoutubeVideo) or isinstance(self.type, YoutubeShort):
            self.video_id = getattr(self.type, "id")
            self.channel_id = None

        elif isinstance(self.type, YoutubeChannel):
            self.channel_id = getattr(self.type, "id")
            self.video_id = None

        else:
            print(self.type)

        self.video_result = None
        self.channel_result = None
__init__(url)

Determine type of YouTube web content.

Parameters:

Name Type Description Default
url str

Target YouTube URL.

required

Attributes:

Name Type Description
link_id str

Target YouTube URL.

type YoutubeChannel | YoutubeVideo | Any

Result of ural's parse_youtube_url() function.

video_id str | None

If a parsed YouTube type is a video, the result's id attribute.

channel_id str | None

If a parsed YouTube type is a channel, the result's id attribute.

video_result None

Empty class instance variable for later storing minet's video result object.

channel_result None

Empty class instance variable for later storing minet's channel result object.

Source code in minall/enrichment/youtube/normalizer.py
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
def __init__(self, url: str) -> None:
    """Determine type of YouTube web content.

    Args:
        url (str): Target YouTube URL.

    Attributes:
        link_id (str): Target YouTube URL.
        type (YoutubeChannel | YoutubeVideo | Any): Result of ural's `parse_youtube_url()` function.
        video_id (str | None): If a parsed YouTube type is a video, the result's `id` attribute.
        channel_id (str | None): If a parsed YouTube type is a channel, the result's `id` attribute.
        video_result (None): Empty class instance variable for later storing minet's video result object.
        channel_result (None): Empty class instance variable for later storing minet's channel result object.
    """
    self.link_id = url

    self.type = parse_youtube_url(url)

    if isinstance(self.type, YoutubeVideo) or isinstance(self.type, YoutubeShort):
        self.video_id = getattr(self.type, "id")
        self.channel_id = None

    elif isinstance(self.type, YoutubeChannel):
        self.channel_id = getattr(self.type, "id")
        self.video_id = None

    else:
        print(self.type)

    self.video_result = None
    self.channel_result = None

normalize(parsed_link)

Normalize minet result objects stored in instance variables of the ParsedLink class.

Parameters:

Name Type Description Default
parsed_link ParsedLink

Class instance with minet's YouTube API results.

required

Returns:

Name Type Description
dict dict

Dictionary to be added to CSV row for 'links' SQL table.

Source code in minall/enrichment/youtube/normalizer.py
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
def normalize(parsed_link: ParsedLink) -> dict:
    """Normalize minet result objects stored in instance variables of the `ParsedLink` class.

    Args:
        parsed_link (ParsedLink): Class instance with minet's YouTube API results.

    Returns:
        dict: Dictionary to be added to CSV row for 'links' SQL table.
    """

    url = parsed_link.link_id
    if isinstance(parsed_link.video_result, MinetYouTubeVideoResult):
        data = NormalizedYouTubeVideo.from_payload(
            url=url,
            channel_result=parsed_link.channel_result,
            video_result=parsed_link.video_result,
        )
        return data.as_csv_dict_row()
    elif isinstance(parsed_link.channel_result, MinetYouTubeChannelResult):
        data = NormalizedYouTubeChannel.from_payload(
            url=url, channel_result=parsed_link.channel_result
        )
        return data.as_csv_dict_row()
    else:
        return {"url": url}

minall.enrichment.youtube.get_data

Module contains function to manage process of collecting and normalizing data about YouTube web content.

get_youtube_data(data, keys, outfile)

Collects and writes metadata about target YouTube videos and channels to a CSV file that will be inserted into 'links' SQL table.

Parameters:

Name Type Description Default
data list[str]

Set of target YouTube URLs.

required
keys list[str]

Set of keys for YouTube API.

required
outfile Path

Path to CSV file for 'links' SQL table.

required
Source code in minall/enrichment/youtube/get_data.py
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
def get_youtube_data(data: list[str], keys: list[str], outfile: Path) -> None:
    """Collects and writes metadata about target YouTube videos and channels to a CSV file that will be inserted into 'links' SQL table.

    Args:
        data (list[str]): Set of target YouTube URLs.
        keys (list[str]): Set of keys for YouTube API.
        outfile (Path): Path to CSV file for 'links' SQL table.
    """
    # Sort the URLs into channels and videos
    parsed_links = [ParsedLink(url) for url in data]
    n_videos = len(
        [
            i
            for i in parsed_links
            if isinstance(i.type, YoutubeVideo) or isinstance(i.type, YoutubeShort)
        ]
    )

    client = YouTubeAPIClient(key=keys)

    # Mutate the parsed_links array by adding video data
    with ProgressBar() as progress:
        t = progress.add_task(
            description="[bold red]Querying YouTube videos", total=n_videos
        )
        for pl in parsed_links:
            if isinstance(pl.type, YoutubeVideo) or isinstance(pl.type, YoutubeShort):
                for _, result in client.videos(videos=[pl.video_id]):
                    setattr(pl, "video_result", result)
                    setattr(pl, "channel_id", getattr(result, "channel_id") if result is not None else None)
                    progress.advance(t)

    # Get a unique set of channels from video and channel data
    channel_set = set()
    for pl in parsed_links:
        if pl.channel_id:
            channel_set.add("https://www.youtube.com/channel/" + pl.channel_id)

    # Create an index of unique channels and their collected metadata
    channel_index = {}
    with ProgressBar() as progress:
        t = progress.add_task(
            description="[bold red]Querying YouTube channels", total=len(channel_set)
        )
        for channel_url, result in client.channels(channels_target=channel_set):
            channel_id = channel_url.split("/")[-1]
            channel_index.update({channel_id: result})
            progress.advance(t)

    # Again mutate the parsed_links array by adding channel data
    for pl in parsed_links:
        if pl.channel_id:
            pl.channel_result = channel_index.get(pl.channel_id)

    with Writer(links_file=outfile) as writer:
        for pl in parsed_links:
            normalized_result = normalize(pl)
            writer.writerow(normalized_result)

minall.enrichment.youtube.context

Module containing contexts for YouTube data collection's CSV writer and progress bar.

ProgressBar

Context for rich progress bar.

Source code in minall/enrichment/youtube/context.py
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
class ProgressBar:
    """Context for rich progress bar."""

    def __init__(self) -> None:
        pass

    def __enter__(self) -> Progress:
        """Start the rich progress bar.

        Returns:
            Progress: Context variable for rich progress bar.
        """

        self.progress_bar = Progress(
            TextColumn("[progress.description]{task.description}"),
            SpinnerColumn(),
            MofNCompleteColumn(),
            TimeElapsedColumn(),
        )
        self.progress_bar.start()
        return self.progress_bar

    def __exit__(self, exc_type, exc_val, exc_tb):
        """Stops the progress bar's context variable."""
        self.progress_bar.stop()
__enter__()

Start the rich progress bar.

Returns:

Name Type Description
Progress Progress

Context variable for rich progress bar.

Source code in minall/enrichment/youtube/context.py
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def __enter__(self) -> Progress:
    """Start the rich progress bar.

    Returns:
        Progress: Context variable for rich progress bar.
    """

    self.progress_bar = Progress(
        TextColumn("[progress.description]{task.description}"),
        SpinnerColumn(),
        MofNCompleteColumn(),
        TimeElapsedColumn(),
    )
    self.progress_bar.start()
    return self.progress_bar
__exit__(exc_type, exc_val, exc_tb)

Stops the progress bar's context variable.

Source code in minall/enrichment/youtube/context.py
42
43
44
def __exit__(self, exc_type, exc_val, exc_tb):
    """Stops the progress bar's context variable."""
    self.progress_bar.stop()

Writer

Context for writing YouTube links metadata to CSV of 'links' SQL table.

Source code in minall/enrichment/youtube/context.py
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
class Writer:
    """Context for writing YouTube links metadata to CSV of 'links' SQL table."""

    def __init__(self, links_file: Path):
        """Set up class for iteratively writing normalized YouTube results to CSV.

        Args:
            links_file (Path): Path to the links table CSV file.
        """
        self.links_file = links_file

    def __enter__(self) -> csv.DictWriter:
        """Start the CSV writer's context.

        Returns:
            csv.DictWriter: Context variable for writing CSV rows.
        """

        self.links_file_obj = open(self.links_file, mode="w")
        self.links_file_writer = csv.DictWriter(
            self.links_file_obj, fieldnames=LinksConstants.col_names
        )
        self.links_file_writer.writeheader()

        return self.links_file_writer

    def __exit__(self, exc_type, exc_val, exc_tb):
        """Stops the writer's context variable."""
        if self.links_file_obj:
            self.links_file_obj.close()
__enter__()

Start the CSV writer's context.

Returns:

Type Description
DictWriter

csv.DictWriter: Context variable for writing CSV rows.

Source code in minall/enrichment/youtube/context.py
58
59
60
61
62
63
64
65
66
67
68
69
70
71
def __enter__(self) -> csv.DictWriter:
    """Start the CSV writer's context.

    Returns:
        csv.DictWriter: Context variable for writing CSV rows.
    """

    self.links_file_obj = open(self.links_file, mode="w")
    self.links_file_writer = csv.DictWriter(
        self.links_file_obj, fieldnames=LinksConstants.col_names
    )
    self.links_file_writer.writeheader()

    return self.links_file_writer
__exit__(exc_type, exc_val, exc_tb)

Stops the writer's context variable.

Source code in minall/enrichment/youtube/context.py
73
74
75
76
def __exit__(self, exc_type, exc_val, exc_tb):
    """Stops the writer's context variable."""
    if self.links_file_obj:
        self.links_file_obj.close()
__init__(links_file)

Set up class for iteratively writing normalized YouTube results to CSV.

Parameters:

Name Type Description Default
links_file Path

Path to the links table CSV file.

required
Source code in minall/enrichment/youtube/context.py
50
51
52
53
54
55
56
def __init__(self, links_file: Path):
    """Set up class for iteratively writing normalized YouTube results to CSV.

    Args:
        links_file (Path): Path to the links table CSV file.
    """
    self.links_file = links_file

minall.enrichment.other_social_media

Module to update ontological subtype for social media posts whose data is not accessible.

minall.enrichment.other_social_media.add_data

Module contains function to write web content ontological subtype information to CSV.

The module contains a function that write the ontological subtype "SocialMediaPosting" and the related target URL to a CSV, which will be inserted into the 'links' SQL table.

add_data(data, outfile)

For the set of target URLs, write the URL and the category "SocialMediaPosting" to a CSV row for insert into the 'links' SQL table.

Parameters:

Name Type Description Default
data List[str]

Target URLs.

required
outfile Path

Path to CSV file for links.

required
Source code in minall/enrichment/other_social_media/add_data.py
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def add_data(data: List[str], outfile: Path):
    """For the set of target URLs, write the URL and the category "SocialMediaPosting" to a CSV row for insert into the 'links' SQL table.

    Args:
        data (List[str]): Target URLs.
        outfile (Path): Path to CSV file for links.
    """
    with open(outfile, "w") as f:
        writer = csv.DictWriter(f, fieldnames=LinksConstants.col_names)
        writer.writeheader()
        [
            writer.writerow({"url": url, "work_type": "SocialMediaPosting"})
            for url in data
        ]

minall.enrichment.article_text

Enrichment workflow's HTML scraping features.

Modules exported by this package:

  • normalizer: Dataclass to normlalize minet's Trafilatura result object.
  • contexts: Context manager for scraper's CSV writers, multi-threader, and progress bar.
  • get_data: Function that runs all of the scraping process.
  • scraper: Class and helper function for scraping HTML.

minall.enrichment.article_text.normalizer

Dataclass to normlalize minet's Trafilatura result object.

NormalizedScrapedWebPage dataclass

Bases: TabularRecord

Dataclass to normlalize minet's Trafilatura result object.

Attributes:

Name Type Description
url str

URL targeted for scraping.

title str | None

Title scraped from HTML.

text str | None

Main text scraped from HTML.

date_published str | None

Date scraped from HTML.

work_type str

Target URL's ontological subtype. Default = "WebPage".

Source code in minall/enrichment/article_text/normalizer.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
@dataclass
class NormalizedScrapedWebPage(TabularRecord):
    """Dataclass to normlalize minet's Trafilatura result object.

    Attributes:
        url (str): URL targeted for scraping.
        title (str | None): Title scraped from HTML.
        text (str | None): Main text scraped from HTML.
        date_published (str | None): Date scraped from HTML.
        work_type (str): Target URL's ontological subtype. Default = "WebPage".
    """

    url: str
    title: str | None
    text: str | None
    date_published: str | None
    work_type: str = "WebPage"

    @classmethod
    def from_payload(
        cls,
        url: str,
        result: TrafilaturaResult,
    ) -> "NormalizedScrapedWebPage":
        return NormalizedScrapedWebPage(
            url=url, title=result.title, text=result.content, date_published=result.date
        )

minall.enrichment.article_text.scraper

Class and helper function for scraping HTML.

This module's Scraper class enhances minet's request() and extract() methods by providing additional support for unexpected HTML encodings.

  1. Uses minet's request() method on a target URL to get a Response object.
  2. Verifies that the Response object is encoded in some form of utf-8.
  3. Extracts the HTML body from the Response. [text = response.text()]
  4. Uses bs4's fool-proof UnicodeDammit to parse the exact encoding. [UnicodeDammit(text, "html.parser").declared_html_encoding]
  5. Gives the encoding to bs4's BeautifulSoup to parse the HTML.
  6. Gives the BeautifulSoup result to minet's extract() method in order to return minet's TrafilaturaResult object.

Scraper

Class to manage HTML scraping.

Examples:

>>> scraper = Scraper()
>>> url, result = scraper(url='https://zenodo.org/records/7974793')
>>> url == result.canonical_url
True
>>> result.title
'Minet, a webmining CLI tool & library for python.'
Source code in minall/enrichment/article_text/scraper.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
class Scraper:
    """Class to manage HTML scraping.

    Examples:
        >>> scraper = Scraper()
        >>> url, result = scraper(url='https://zenodo.org/records/7974793')
        >>> url == result.canonical_url
        True
        >>> result.title
        'Minet, a webmining CLI tool & library for python.'
    """

    def __init__(
        self, progress: Progress | None = None, total: int | None = None
    ) -> None:
        """If provided the context of a rich progress bar, save it to the class instance and add the task 'Scraping webpage'.

        Args:
            progress (Progress | None, optional): Context of a rich progress bar instance. Defaults to None.
            total (int | None, optional): Total number of items treated during progress context. Defaults to None.
        """
        self.progress = progress
        if progress:
            self.progress = progress
            t = progress.add_task(
                description="[bold yellow]Scraping webpage", total=total
            )
            self.task_id = t

    def __call__(self, url: str) -> Tuple[str, TrafilaturaResult | None]:
        """Requests and scrapes HTML, returning minet's Trafilatura Result object.

        Args:
            url (str): Target URL.

        Returns:
            Tuple[str, TrafilaturaResult | None]: The target URL and, if scraping was successful, minet's Trafilatura Result object.
        """
        if self.progress:
            self.progress.advance(self.task_id)
        result = None
        response = None

        # Request URL's HTML
        try:
            response = request(url)
        except Exception as e:
            logging.error(e)

        # Parse requested HTML
        if response and good_response(response):
            text = response.text()
            try:
                # Avoid input conversion error, deriving from inside Trafilatura's lxml dependency
                encoding = UnicodeDammit(text, "html.parser").declared_html_encoding
                soup = BeautifulSoup(text, features="lxml", from_encoding=encoding)
                text = soup.decode(formatter="html")
                result = extract(text)
            except Exception as e:
                logger.exception(e)
        return url, result
__call__(url)

Requests and scrapes HTML, returning minet's Trafilatura Result object.

Parameters:

Name Type Description Default
url str

Target URL.

required

Returns:

Type Description
Tuple[str, TrafilaturaResult | None]

Tuple[str, TrafilaturaResult | None]: The target URL and, if scraping was successful, minet's Trafilatura Result object.

Source code in minall/enrichment/article_text/scraper.py
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
def __call__(self, url: str) -> Tuple[str, TrafilaturaResult | None]:
    """Requests and scrapes HTML, returning minet's Trafilatura Result object.

    Args:
        url (str): Target URL.

    Returns:
        Tuple[str, TrafilaturaResult | None]: The target URL and, if scraping was successful, minet's Trafilatura Result object.
    """
    if self.progress:
        self.progress.advance(self.task_id)
    result = None
    response = None

    # Request URL's HTML
    try:
        response = request(url)
    except Exception as e:
        logging.error(e)

    # Parse requested HTML
    if response and good_response(response):
        text = response.text()
        try:
            # Avoid input conversion error, deriving from inside Trafilatura's lxml dependency
            encoding = UnicodeDammit(text, "html.parser").declared_html_encoding
            soup = BeautifulSoup(text, features="lxml", from_encoding=encoding)
            text = soup.decode(formatter="html")
            result = extract(text)
        except Exception as e:
            logger.exception(e)
    return url, result
__init__(progress=None, total=None)

If provided the context of a rich progress bar, save it to the class instance and add the task 'Scraping webpage'.

Parameters:

Name Type Description Default
progress Progress | None

Context of a rich progress bar instance. Defaults to None.

None
total int | None

Total number of items treated during progress context. Defaults to None.

None
Source code in minall/enrichment/article_text/scraper.py
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def __init__(
    self, progress: Progress | None = None, total: int | None = None
) -> None:
    """If provided the context of a rich progress bar, save it to the class instance and add the task 'Scraping webpage'.

    Args:
        progress (Progress | None, optional): Context of a rich progress bar instance. Defaults to None.
        total (int | None, optional): Total number of items treated during progress context. Defaults to None.
    """
    self.progress = progress
    if progress:
        self.progress = progress
        t = progress.add_task(
            description="[bold yellow]Scraping webpage", total=total
        )
        self.task_id = t

good_response(response)

Verifies that the response that minet's request method returned is valid for scraping.

Parameters:

Name Type Description Default
response Response

Response object returned from minet's request method.

required

Returns:

Type Description
Response | None

Response | None: If valid, the Response; otherwise None.

Source code in minall/enrichment/article_text/scraper.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
def good_response(response: Response) -> Response | None:
    """Verifies that the response that minet's request method returned is valid for scraping.

    Args:
        response (Response): Response object returned from minet's request method.

    Returns:
        Response | None: If valid, the Response; otherwise None.
    """
    if (
        response.is_text
        and response.encoding
        and "utf" in response.encoding
        and "8" in response.encoding
    ):
        return response

minall.enrichment.article_text.get_data

Module contains a function that runs the scraping feature.

get_data(data, outfile)

Iterating through the target URLs, scrape data and write to out-file.

Parameters:

Name Type Description Default
data list[str]

Set of target URLs for scraping.

required
outfile Path

Path to CSV file for writing normalized results.

required
Source code in minall/enrichment/article_text/get_data.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def get_data(data: list[str], outfile: Path):
    """Iterating through the target URLs, scrape data and write to out-file.

    Args:
        data (list[str]): Set of target URLs for scraping.
        outfile (Path): Path to CSV file for writing normalized results.
    """
    with ContextManager(links_file=outfile) as contexts:
        writer, executor, progress = contexts
        scraper = Scraper(progress=progress, total=len(data))
        for url, result in executor.map(scraper, data):
            if result:
                formatted_result = NormalizedScrapedWebPage.from_payload(
                    url=url, result=result
                )
                writer.writerow(formatted_result.as_csv_dict_row())

minall.enrichment.article_text.contexts

Context manager for scraper's CSV writers, multi-threader, and progress bar.

ContextManager

Source code in minall/enrichment/article_text/contexts.py
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
class ContextManager:
    def __init__(self, links_file: Path):
        """Set up class for scraper's contexts.

        Args:
            links_file (Path): Path to out-file for CSV writer.
        """
        self.links_file = links_file

    def __enter__(self) -> Tuple[csv.DictWriter, ThreadPoolExecutor, Progress]:
        """Start the scraper's context variables.

        Returns:
            Tuple[csv.DictWriter, ThreadPoolExecutor, Progress]: Context variables.
        """
        # Set up links file writer
        self.links_file_obj = open(self.links_file, mode="w", encoding="utf-8")
        self.links_file_writer = csv.DictWriter(
            self.links_file_obj, fieldnames=LinksConstants.col_names
        )
        self.links_file_writer.writeheader()

        # Set up multi-threading pool
        self.executor = ThreadPoolExecutor(max_workers=3)

        # Set up progress bar
        self.progress_bar = Progress(
            TextColumn("[progress.description]{task.description}"),
            SpinnerColumn(),
            MofNCompleteColumn(),
            TimeElapsedColumn(),
        )
        self.progress_bar.start()

        return (
            self.links_file_writer,
            self.executor,
            self.progress_bar,
        )

    def __exit__(self, exc_type, exc_val, exc_tb):
        """Stop the scraper's context variables."""
        if self.links_file_obj:
            self.links_file_obj.close()
        self.executor.shutdown(wait=False, cancel_futures=True)
        self.progress_bar.stop()
__enter__()

Start the scraper's context variables.

Returns:

Type Description
Tuple[DictWriter, ThreadPoolExecutor, Progress]

Tuple[csv.DictWriter, ThreadPoolExecutor, Progress]: Context variables.

Source code in minall/enrichment/article_text/contexts.py
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def __enter__(self) -> Tuple[csv.DictWriter, ThreadPoolExecutor, Progress]:
    """Start the scraper's context variables.

    Returns:
        Tuple[csv.DictWriter, ThreadPoolExecutor, Progress]: Context variables.
    """
    # Set up links file writer
    self.links_file_obj = open(self.links_file, mode="w", encoding="utf-8")
    self.links_file_writer = csv.DictWriter(
        self.links_file_obj, fieldnames=LinksConstants.col_names
    )
    self.links_file_writer.writeheader()

    # Set up multi-threading pool
    self.executor = ThreadPoolExecutor(max_workers=3)

    # Set up progress bar
    self.progress_bar = Progress(
        TextColumn("[progress.description]{task.description}"),
        SpinnerColumn(),
        MofNCompleteColumn(),
        TimeElapsedColumn(),
    )
    self.progress_bar.start()

    return (
        self.links_file_writer,
        self.executor,
        self.progress_bar,
    )
__exit__(exc_type, exc_val, exc_tb)

Stop the scraper's context variables.

Source code in minall/enrichment/article_text/contexts.py
62
63
64
65
66
67
def __exit__(self, exc_type, exc_val, exc_tb):
    """Stop the scraper's context variables."""
    if self.links_file_obj:
        self.links_file_obj.close()
    self.executor.shutdown(wait=False, cancel_futures=True)
    self.progress_bar.stop()
__init__(links_file)

Set up class for scraper's contexts.

Parameters:

Name Type Description Default
links_file Path

Path to out-file for CSV writer.

required
Source code in minall/enrichment/article_text/contexts.py
23
24
25
26
27
28
29
def __init__(self, links_file: Path):
    """Set up class for scraper's contexts.

    Args:
        links_file (Path): Path to out-file for CSV writer.
    """
    self.links_file = links_file