Skip to content

Table tools

With SQLite, the module minall/tables manages the data during the enrichment process, from the data input at the start to the updated version exported at the end. The process relies on the following two tables:

  1. The links table, which is the backbone of the enrichment, stores the target URLs and their enriched metadata.

  2. The shared_content table, which is optional, stores URLs pointing to content shared via the target URLs' content.

As illustrated in the figure below, the two tables are related. The target URL (url) in the links table refers to the post_url in the shared_content table. A target URL (url) in the links table can share 0 or more items. Depending on the URLs dataset, it could be the case that no entities in the links table have shared any content. All entities in the shared_content must relate to at least one entity in the links table. Content in the shared_content table can have been shared by 1 or more URLs in the links table.

erDiagram
    LINKS }|--o{ SHARED_CONTENT : shares
    LINKS {
        text url PK
        text domain
        text work_type
        text duration
        text identifier
        text date_published
        text date_modified
        text country_of_origin
        text abstract
        text keywords
        text title
        text text
        text hashtags
        text creator_type
        text creator_date_created
        text creator_identifier
        integer creator_facebook_follow
        integer creator_facebook_subscribe
        integer creator_twitter_follow
        integer creator_youtube_subscribe
        integer creator_create_video
        text creator_name
        text creator_url
        integer facebook_comment
        integer facebook_like
        integer facebook_share
        integer pinterest_share
        integer twitter_share
        integer tiktok_share
        integer tiktok_comment
        integer reddit_engagement
        integer youtube_watch
        integer youtube_comment
        integer youtube_like
        integer youtube_favorite
        integer youtube_subscribe
        integer create_video
    }
    SHARED_CONTENT {
        text post_url PK
        text content_url PK
        text media_type
        integer height
        integer width
    }

LinksConstants dataclass

Dataclass to manage 'links' table.

This dataclass manages the 'links' table's required column names and their data types.

Attributes:

Name Type Description
table_name str

Name of the table. Default = "links".

primary_key str

Text string of primary key. Default = "url".

pk_list list

List of primary key columns. Default = ["url"]

dtypes dict

Key-value pairs of column names and SQLite data type descriptions.

col_names list

List of column names.

Source code in minall/tables/links.py
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
@dataclass
class LinksConstants:
    """Dataclass to manage 'links' table.

    This dataclass manages the 'links' table's required column names and their data types.

    Attributes:
        table_name (str): Name of the table. Default = "links".
        primary_key (str): Text string of primary key. Default = "url".
        pk_list (list): List of primary key columns. Default = ["url"]
        dtypes (dict): Key-value pairs of column names and SQLite data type descriptions.
        col_names (list): List of column names.
    """

    table_name: str = "links"
    primary_key: str = "url"
    pk_list = ["url"]
    dtypes = {
        "url": "TEXT",
        "domain": "TEXT",
        "work_type": "TEXT",
        "duration": "TEXT",
        "identifier": "TEXT",
        "date_published": "TEXT",
        "date_modified": "TEXT",
        "country_of_origin": "TEXT",
        "abstract": "TEXT",
        "keywords": "TEXT",
        "title": "TEXT",
        "text": "TEXT",
        "hashtags": "TEXT",
        "creator_type": "TEXT",
        "creator_date_created": "TEXT",
        "creator_location_created": "TEXT",
        "creator_identifier": "TEXT",
        "creator_facebook_follow": "INTEGER",
        "creator_facebook_subscribe": "INTEGER",
        "creator_twitter_follow": "INTEGER",
        "creator_youtube_subscribe": "INTEGER",
        "creator_create_video": "INTEGER",
        "creator_name": "TEXT",
        "creator_url": "TEXT",
        "facebook_comment": "INTEGER",
        "facebook_like": "INTEGER",
        "facebook_share": "INTEGER",
        "pinterest_share": "INTEGER",
        "twitter_share": "INTEGER",
        "twitter_like": "INTEGER",
        "tiktok_share": "INTEGER",
        "tiktok_comment": "INTEGER",
        "reddit_engagement": "INTEGER",
        "youtube_watch": "INTEGER",
        "youtube_comment": "INTEGER",
        "youtube_like": "INTEGER",
        "youtube_favorite": "INTEGER",
        "youtube_subscribe": "INTEGER",
        "create_video": "INTEGER",
    }
    col_names = dtypes.keys()

LinksTable

Bases: BaseTable

Class for creating, updating, and reading SQL table for target URLs.

Source code in minall/tables/links.py
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
class LinksTable(BaseTable):
    """Class for creating, updating, and reading SQL table for target URLs."""

    dtypes = LinksConstants.dtypes
    name = LinksConstants.table_name
    pk_list = LinksConstants.pk_list

    def __init__(
        self, conn: Connection, infile: Path, outfile: Path, url_col: str | None = None
    ):
        """In database connection, create SQL table and populate with data from target URLs dataset file.

        Args:
            conn (Connection): SQLite connection.
            infile (Path): Path to URLs dataset file.
            url_col (str | None, optional): Column name of target URLs. Defaults to None.

        Raises:
            NoCSVHeaders: Dataset file does not have headers.
            KeyError: User did not define URL column and dataset file does not have default column 'url'.
            NoURLColumn: User-defined URL column not found in dataset file.
        """
        # Update the table's columns to include all in-file columns
        self.dtypes = self._parse_infile_columns(
            infile=infile, url_col=url_col, constant_cols=self.dtypes
        )
        # Inherit the parent base class
        super().__init__(
            name=self.name,
            pk=self.pk_list,
            dtypes=self.dtypes,
            conn=conn,
            outfile=outfile,
        )

        # Insert in-file data
        with open(infile) as f:
            reader = csv.DictReader(f)

            # Confirm the in-file is compatible with the table
            headers = reader.fieldnames
            if not headers:
                raise NoCSVHeaders()
            elif len(set(headers).difference(self.dtype_dict.keys())) > 0:
                raise KeyError()

            # Insert the in-file data
            for row in reader:
                if url_col:
                    row.update({"url": row[url_col]})
                placeholder = ", ".join(["?" for _ in range(len(row.items()))])
                cols, values = ", ".join(row.keys()), tuple(list(row.values()))
                query = """
                INSERT OR IGNORE INTO {table}({cols})
                VALUES ({placeholder})
                """.format(
                    table=self.name, cols=cols, placeholder=placeholder
                )
                self.execute(query=query, values=values)

    def _parse_infile_columns(
        self, infile: Path, constant_cols: Dict, url_col: str | None
    ) -> Dict:
        """During init method, modify table columns to include in-file's columns.

        Args:
            infile (Path): Path to URLs dataset.
            constant_cols (Dict): Key-value pairs of table's standard columns and data types.
            url_col (str | None, optional): Column name of target URLs.

        Raises:
            NoCSVHeaders: The infile does not have headers.
            KeyError: The infile does not have a recognizable URL column.
            NoURLColumn: The infile does not have the user-declared URL column.

        Returns:
            Dict: Key-value pairs of table's column names and data types.
        """
        with casanova.reader(infile) as reader:
            headers = reader.headers
        if not headers:
            raise NoCSVHeaders()
        elif not url_col and not "url" in headers:
            raise KeyError()
        elif url_col and not url_col in headers:
            raise NoURLColumn(url_col=url_col)
        dtypes = {col: "TEXT" for col in headers}
        return dtypes | constant_cols

__init__(conn, infile, outfile, url_col=None)

In database connection, create SQL table and populate with data from target URLs dataset file.

Parameters:

Name Type Description Default
conn Connection

SQLite connection.

required
infile Path

Path to URLs dataset file.

required
url_col str | None

Column name of target URLs. Defaults to None.

None

Raises:

Type Description
NoCSVHeaders

Dataset file does not have headers.

KeyError

User did not define URL column and dataset file does not have default column 'url'.

NoURLColumn

User-defined URL column not found in dataset file.

Source code in minall/tables/links.py
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
def __init__(
    self, conn: Connection, infile: Path, outfile: Path, url_col: str | None = None
):
    """In database connection, create SQL table and populate with data from target URLs dataset file.

    Args:
        conn (Connection): SQLite connection.
        infile (Path): Path to URLs dataset file.
        url_col (str | None, optional): Column name of target URLs. Defaults to None.

    Raises:
        NoCSVHeaders: Dataset file does not have headers.
        KeyError: User did not define URL column and dataset file does not have default column 'url'.
        NoURLColumn: User-defined URL column not found in dataset file.
    """
    # Update the table's columns to include all in-file columns
    self.dtypes = self._parse_infile_columns(
        infile=infile, url_col=url_col, constant_cols=self.dtypes
    )
    # Inherit the parent base class
    super().__init__(
        name=self.name,
        pk=self.pk_list,
        dtypes=self.dtypes,
        conn=conn,
        outfile=outfile,
    )

    # Insert in-file data
    with open(infile) as f:
        reader = csv.DictReader(f)

        # Confirm the in-file is compatible with the table
        headers = reader.fieldnames
        if not headers:
            raise NoCSVHeaders()
        elif len(set(headers).difference(self.dtype_dict.keys())) > 0:
            raise KeyError()

        # Insert the in-file data
        for row in reader:
            if url_col:
                row.update({"url": row[url_col]})
            placeholder = ", ".join(["?" for _ in range(len(row.items()))])
            cols, values = ", ".join(row.keys()), tuple(list(row.values()))
            query = """
            INSERT OR IGNORE INTO {table}({cols})
            VALUES ({placeholder})
            """.format(
                table=self.name, cols=cols, placeholder=placeholder
            )
            self.execute(query=query, values=values)

minall.tables.shared_content

ShareContentConstants dataclass

Dataclass to manage 'shared_content' table.

This dataclass manages the 'shared_content' table's required column names and their data types. Being a dataclass, however, the instance of the class can also be subsequently modified to include other column names (and their data types) according to the input data. The 'shared_content' table is meant to relate to the 'links' table, wherein the former's 'post_url' column refers to the latter's 'url' column.

Contrary to the 'links' table, whose primary key column can be derived from any declared target URL column in the input data, the 'shared_content' table requires the input data has the two columns that jointly compose its primary key, 'post_url' and 'content_url.'

Attributes:

Name Type Description
table_name str

Name of the table. Default = "shared_content".

primary_key str

Text string of composite primary key. Default = "post_url,content_url".

pk_list list

List of comosite primary key columns. Default = ["post_url", "content_url]

dtypes dict

Key-value pairs of column names and SQLite data type descriptions.

col_names list

List of column names.

Source code in minall/tables/shared_content.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
@dataclass
class ShareContentConstants:
    """Dataclass to manage 'shared_content' table.

    This dataclass manages the 'shared_content' table's required column names and their data types. Being a dataclass, however, the instance of the class can also be subsequently modified to include other column names (and their data types) according to the input data. The 'shared_content' table is meant to relate to the 'links' table, wherein the former's 'post_url' column refers to the latter's 'url' column.

    Contrary to the 'links' table, whose primary key column can be derived from any declared target URL column in the input data, the 'shared_content' table requires the input data has the two columns that jointly compose its primary key, 'post_url' and 'content_url.'

    Attributes:
        table_name (str): Name of the table. Default = "shared_content".
        primary_key (str): Text string of composite primary key. Default = "post_url,content_url".
        pk_list (list): List of comosite primary key columns. Default = ["post_url", "content_url]
        dtypes (dict): Key-value pairs of column names and SQLite data type descriptions.
        col_names (list): List of column names.
    """

    table_name = "shared_content"
    primary_key = "post_url,content_url"
    pk_list = ["post_url", "content_url"]
    dtypes = {
        "post_url": f"TEXT REFERENCES {LinksConstants.table_name}(url) ON UPDATE CASCADE",
        "media_type": "TEXT",
        "content_url": "TEXT",
        "height": "INTEGER",
        "width": "INTEGER",
    }
    col_names = dtypes.keys()

SharedContentTable

Bases: BaseTable

Class for creating, updating, and reading SQL table for shared media content, each of which is related to 1 or more entities in the 'links' table.

Source code in minall/tables/shared_content.py
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
class SharedContentTable(BaseTable):
    """Class for creating, updating, and reading SQL table for shared media content, each of which is related to 1 or more entities in the 'links' table."""

    dtypes = ShareContentConstants.dtypes
    name = ShareContentConstants.table_name
    pk_list = ShareContentConstants.pk_list

    def __init__(self, conn: Connection, infile: Path | None, outfile: Path):
        """In database connection, create SQL table. If the user provides an existing shared_content.csv file, populate the table with that input.

        Args:
            conn (Connection): SQLite connection.
            infile (Path): Path to shared content dataset file.

        Raises:
            NoCSVHeaders: Dataset file does not have headers.
            NoPrimaryKeyColumns: One or more of the required columns ('post_url', 'content_url') is not in the dataset file.
        """
        # Update the table's columns to include all in-file columns
        if infile:
            self.dtypes = self._parse_infile_columns(
                infile=infile, constant_cols=self.dtypes
            )
        # Inherit the parent base class
        super().__init__(
            name=self.name,
            pk=self.pk_list,
            dtypes=self.dtypes,
            conn=conn,
            outfile=outfile,
        )

        # Insert in-file data
        if infile:
            with open(infile) as f:
                reader = csv.DictReader(f)

                # Confirm the in-file is compatible with the table
                headers = reader.fieldnames
                if not headers:
                    raise NoCSVHeaders()
                elif len(set(headers).difference(self.dtype_dict.keys())) > 0:
                    raise KeyError()

                # Insert the in-file data
                for row in reader:
                    placeholder = ", ".join(["?" for _ in range(len(row.items()))])
                    cols, values = ", ".join(row.keys()), tuple(list(row.values()))
                    query = """
                    INSERT OR IGNORE INTO {table}({cols})
                    VALUES ({placeholder})
                    """.format(
                        table=self.name, cols=cols, placeholder=placeholder
                    )
                    self.execute(query=query, values=values)

    def _parse_infile_columns(self, infile: Path, constant_cols: Dict) -> Dict:
        """During init method, modify table columns to include in-file's columns.

        Args:
            infile (Path): Path to URLs dataset.
            constant_cols (Dict): Key-value pairs of table's standard columns and data types.
            url_col (str | None, optional): Column name of target URLs.

        Raises:
            NoCSVHeaders: The infile does not have headers.
            KeyError: The infile does not have a recognizable URL column.
            NoURLColumn: The infile does not have the user-declared URL column.

        Returns:
            Dict: Key-value pairs of table's column names and data types.
        """
        with casanova.reader(infile) as reader:
            headers = reader.headers
        if not headers:
            raise NoCSVHeaders()
        diff = set(self.pk_list).difference(headers)
        if len(diff) > 0:
            raise NoPrimaryKeyColumns(col=list(diff)[0])
        dtypes = {col: "TEXT" for col in headers}
        return dtypes | constant_cols

__init__(conn, infile, outfile)

In database connection, create SQL table. If the user provides an existing shared_content.csv file, populate the table with that input.

Parameters:

Name Type Description Default
conn Connection

SQLite connection.

required
infile Path

Path to shared content dataset file.

required

Raises:

Type Description
NoCSVHeaders

Dataset file does not have headers.

NoPrimaryKeyColumns

One or more of the required columns ('post_url', 'content_url') is not in the dataset file.

Source code in minall/tables/shared_content.py
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def __init__(self, conn: Connection, infile: Path | None, outfile: Path):
    """In database connection, create SQL table. If the user provides an existing shared_content.csv file, populate the table with that input.

    Args:
        conn (Connection): SQLite connection.
        infile (Path): Path to shared content dataset file.

    Raises:
        NoCSVHeaders: Dataset file does not have headers.
        NoPrimaryKeyColumns: One or more of the required columns ('post_url', 'content_url') is not in the dataset file.
    """
    # Update the table's columns to include all in-file columns
    if infile:
        self.dtypes = self._parse_infile_columns(
            infile=infile, constant_cols=self.dtypes
        )
    # Inherit the parent base class
    super().__init__(
        name=self.name,
        pk=self.pk_list,
        dtypes=self.dtypes,
        conn=conn,
        outfile=outfile,
    )

    # Insert in-file data
    if infile:
        with open(infile) as f:
            reader = csv.DictReader(f)

            # Confirm the in-file is compatible with the table
            headers = reader.fieldnames
            if not headers:
                raise NoCSVHeaders()
            elif len(set(headers).difference(self.dtype_dict.keys())) > 0:
                raise KeyError()

            # Insert the in-file data
            for row in reader:
                placeholder = ", ".join(["?" for _ in range(len(row.items()))])
                cols, values = ", ".join(row.keys()), tuple(list(row.values()))
                query = """
                INSERT OR IGNORE INTO {table}({cols})
                VALUES ({placeholder})
                """.format(
                    table=self.name, cols=cols, placeholder=placeholder
                )
                self.execute(query=query, values=values)

minall.tables.base

Create and execute queries on SQLite tables.

This module contains the class BaseTable that manages the SQLite database's tables. It contains the following methods:

BaseTable

Base class for SQLite tables.

Source code in minall/tables/base.py
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
class BaseTable:
    """Base class for SQLite tables."""

    def __init__(
        self, name: str, pk: List[str], conn: Connection, dtypes: Dict, outfile: Path
    ) -> None:
        """Create the SQL table with the given columns and data types.

        Args:
            name (str): Table name.
            pk (List[str]): List of primary keys.
            conn (Connection): SQLite connection.
            dtypes (Dict): Key-value pairs of column names and data types.
            outfile (Path): Path to CSV file where the table will be exported.
        """
        self.conn = conn
        self.name = name
        self.pk_list = pk
        self.pk_str = ",".join(pk)
        self.dtype_dict = dtypes
        self.outfile = outfile

        # Create the table
        self.execute(query=f"DROP TABLE IF EXISTS {self.name}")
        self.execute(query=self.create_query)

    def export(self, outfile: Path | None = None):
        """Write the SQL table to a CSV file.

        Args:
            outfile (Path | None, optional): Path to out-file. Defaults to None.
        """
        if not outfile:
            outfile = self.outfile
        cursor = self.conn.cursor()
        headers = [
            t[1]
            for t in cursor.execute(
                "SELECT * FROM pragma_table_info('{}');".format(self.name)
            ).fetchall()
        ]
        rows = cursor.execute(f"SELECT * FROM {self.name}").fetchall()
        with open(outfile, "w") as f:
            writer = csv.writer(f)
            writer.writerow(headers)
            for row in rows:
                writer.writerow(row)

    def update_from_csv(self, datafile: Path):
        """Reading from a CSV file, update the table rows.

        Args:
            datafile (Path): Path to file with new data.
        """
        with open(datafile) as f:
            reader = csv.DictReader(f)
            for row in reader:
                placeholder = ", ".join(["?" for _ in range(len(row.items()))])
                cols_in_csv = ", ".join(row.keys())
                # Replace empty strings in values set with None
                values = []
                for v in row.values():
                    if v == "":
                        values.append(None)
                    else:
                        values.append(v)
                coalesce_stmt = self.coalesce_statement(row.keys())
                query = """
                INSERT INTO {table}({cols_in_csv})
                VALUES ({placeholder})
                ON CONFLICT ({pk})
                DO UPDATE SET {coalesce_stmt}
                """.format(
                    table=self.name,
                    cols_in_csv=cols_in_csv,
                    placeholder=placeholder,
                    pk=self.pk_str,
                    coalesce_stmt=coalesce_stmt,
                )
                self.execute(query=query, values=tuple(values))

    def coalesce_statement(self, cols: Iterable[str]) -> str:
        """Compose SQL coalesce statement from columns to be updated.

        Examples:
            >>> # Set up connection and columns / data types for table.
            >>> from minall.utils.database import connect_to_database
            >>> columns_n_datatypes = {"url": "TEXT", "domain": "TEXT", "work_type": "TEXT"}
            >>>
            >>> # Create table.
            >>> table = BaseTable(name="test", pk=["url"], conn=connect_to_database(), dtypes=columns_n_datatypes, outfile=Path("test.csv"))
            >>>
            >>> # Compose SQL statement to replace table's row with new data.
            >>> table.coalesce_statement(cols=columns_n_datatypes.keys())
            'domain=COALESCE(excluded.domain, domain), work_type=COALESCE(excluded.work_type, work_type)'

        Args:
            cols (Iterable[str]): Row columns.

        Returns:
            str: SQL statement.
        """
        return ", ".join(
            [f"{k}=COALESCE(excluded.{k}, {k})" for k in cols if k not in self.pk_list]
        )

    @property
    def create_query(self) -> str:
        """SQL statement to create table.

        Examples:
            >>> # Set up connection and columns / data types for table.
            >>> from minall.utils.database import connect_to_database
            >>> columns_n_datatypes = {"url": "TEXT", "domain": "TEXT", "work_type": "TEXT"}
            >>>
            >>> # Create table.
            >>> table = BaseTable(name="test", pk=["url"], conn=connect_to_database(), dtypes=columns_n_datatypes, outfile=Path("test.csv"))
            >>>
            >>> # Compose SQL statement to create table.
            >>> table.create_query
            >>> 'CREATE TABLE IF NOT EXISTS test(url TEXT, domain TEXT, work_type TEXT, PRIMARY KEY (url))'


        Returns:
            str: _description_
        """
        cols = ", ".join([f"{k} {v}" for k, v in self.dtype_dict.items()])
        return (
            """CREATE TABLE IF NOT EXISTS {table}({cols}, PRIMARY KEY ({pk}))""".format(
                table=self.name, cols=cols, pk=self.pk_str
            )
        )

    def execute(self, query: str, values: Tuple | None = None):
        """Function to commit a query to the database connection.

        Args:
            query (str): SQL statement.
            values (Tuple | None, optional): Values to be inserted in the query's placeholders. Defaults to None.

        Raises:
            e: SQLite Exception.
        """
        cursor = self.conn.cursor()
        try:
            if values:
                cursor.execute(query, values)
            else:
                cursor.execute(query)
        except Exception as e:
            print("\n\n", query, "\n\n")
            raise e

    def select_from(self, cols: str, filter: str | None = None) -> List:
        """Function to select rows from the SQL table.

        Args:
            cols (str): Target of SELECT statement.
            filter (str | None, optional): Where condition to apply after FROM statement. Defaults to None.

        Raises:
            e: SQLite Exception.

        Returns:
            List: List of rows.
        """
        if not filter:
            filter = ""
        else:
            filter = " " + filter
        query = f"select {cols} from {self.name}{filter}"
        cursor = self.conn.cursor()
        try:
            response = cursor.execute(query)
            self.conn.commit()
        except Exception as e:
            print("\n\n", query, "\n\n")
            raise e
        else:
            return response.fetchall()

create_query: str property

SQL statement to create table.

Examples:

>>> # Set up connection and columns / data types for table.
>>> from minall.utils.database import connect_to_database
>>> columns_n_datatypes = {"url": "TEXT", "domain": "TEXT", "work_type": "TEXT"}
>>>
>>> # Create table.
>>> table = BaseTable(name="test", pk=["url"], conn=connect_to_database(), dtypes=columns_n_datatypes, outfile=Path("test.csv"))
>>>
>>> # Compose SQL statement to create table.
>>> table.create_query
>>> 'CREATE TABLE IF NOT EXISTS test(url TEXT, domain TEXT, work_type TEXT, PRIMARY KEY (url))'

Returns:

Name Type Description
str str

description

__init__(name, pk, conn, dtypes, outfile)

Create the SQL table with the given columns and data types.

Parameters:

Name Type Description Default
name str

Table name.

required
pk List[str]

List of primary keys.

required
conn Connection

SQLite connection.

required
dtypes Dict

Key-value pairs of column names and data types.

required
outfile Path

Path to CSV file where the table will be exported.

required
Source code in minall/tables/base.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def __init__(
    self, name: str, pk: List[str], conn: Connection, dtypes: Dict, outfile: Path
) -> None:
    """Create the SQL table with the given columns and data types.

    Args:
        name (str): Table name.
        pk (List[str]): List of primary keys.
        conn (Connection): SQLite connection.
        dtypes (Dict): Key-value pairs of column names and data types.
        outfile (Path): Path to CSV file where the table will be exported.
    """
    self.conn = conn
    self.name = name
    self.pk_list = pk
    self.pk_str = ",".join(pk)
    self.dtype_dict = dtypes
    self.outfile = outfile

    # Create the table
    self.execute(query=f"DROP TABLE IF EXISTS {self.name}")
    self.execute(query=self.create_query)

coalesce_statement(cols)

Compose SQL coalesce statement from columns to be updated.

Examples:

>>> # Set up connection and columns / data types for table.
>>> from minall.utils.database import connect_to_database
>>> columns_n_datatypes = {"url": "TEXT", "domain": "TEXT", "work_type": "TEXT"}
>>>
>>> # Create table.
>>> table = BaseTable(name="test", pk=["url"], conn=connect_to_database(), dtypes=columns_n_datatypes, outfile=Path("test.csv"))
>>>
>>> # Compose SQL statement to replace table's row with new data.
>>> table.coalesce_statement(cols=columns_n_datatypes.keys())
'domain=COALESCE(excluded.domain, domain), work_type=COALESCE(excluded.work_type, work_type)'

Parameters:

Name Type Description Default
cols Iterable[str]

Row columns.

required

Returns:

Name Type Description
str str

SQL statement.

Source code in minall/tables/base.py
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
def coalesce_statement(self, cols: Iterable[str]) -> str:
    """Compose SQL coalesce statement from columns to be updated.

    Examples:
        >>> # Set up connection and columns / data types for table.
        >>> from minall.utils.database import connect_to_database
        >>> columns_n_datatypes = {"url": "TEXT", "domain": "TEXT", "work_type": "TEXT"}
        >>>
        >>> # Create table.
        >>> table = BaseTable(name="test", pk=["url"], conn=connect_to_database(), dtypes=columns_n_datatypes, outfile=Path("test.csv"))
        >>>
        >>> # Compose SQL statement to replace table's row with new data.
        >>> table.coalesce_statement(cols=columns_n_datatypes.keys())
        'domain=COALESCE(excluded.domain, domain), work_type=COALESCE(excluded.work_type, work_type)'

    Args:
        cols (Iterable[str]): Row columns.

    Returns:
        str: SQL statement.
    """
    return ", ".join(
        [f"{k}=COALESCE(excluded.{k}, {k})" for k in cols if k not in self.pk_list]
    )

execute(query, values=None)

Function to commit a query to the database connection.

Parameters:

Name Type Description Default
query str

SQL statement.

required
values Tuple | None

Values to be inserted in the query's placeholders. Defaults to None.

None

Raises:

Type Description
e

SQLite Exception.

Source code in minall/tables/base.py
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
def execute(self, query: str, values: Tuple | None = None):
    """Function to commit a query to the database connection.

    Args:
        query (str): SQL statement.
        values (Tuple | None, optional): Values to be inserted in the query's placeholders. Defaults to None.

    Raises:
        e: SQLite Exception.
    """
    cursor = self.conn.cursor()
    try:
        if values:
            cursor.execute(query, values)
        else:
            cursor.execute(query)
    except Exception as e:
        print("\n\n", query, "\n\n")
        raise e

export(outfile=None)

Write the SQL table to a CSV file.

Parameters:

Name Type Description Default
outfile Path | None

Path to out-file. Defaults to None.

None
Source code in minall/tables/base.py
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def export(self, outfile: Path | None = None):
    """Write the SQL table to a CSV file.

    Args:
        outfile (Path | None, optional): Path to out-file. Defaults to None.
    """
    if not outfile:
        outfile = self.outfile
    cursor = self.conn.cursor()
    headers = [
        t[1]
        for t in cursor.execute(
            "SELECT * FROM pragma_table_info('{}');".format(self.name)
        ).fetchall()
    ]
    rows = cursor.execute(f"SELECT * FROM {self.name}").fetchall()
    with open(outfile, "w") as f:
        writer = csv.writer(f)
        writer.writerow(headers)
        for row in rows:
            writer.writerow(row)

select_from(cols, filter=None)

Function to select rows from the SQL table.

Parameters:

Name Type Description Default
cols str

Target of SELECT statement.

required
filter str | None

Where condition to apply after FROM statement. Defaults to None.

None

Raises:

Type Description
e

SQLite Exception.

Returns:

Name Type Description
List List

List of rows.

Source code in minall/tables/base.py
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
def select_from(self, cols: str, filter: str | None = None) -> List:
    """Function to select rows from the SQL table.

    Args:
        cols (str): Target of SELECT statement.
        filter (str | None, optional): Where condition to apply after FROM statement. Defaults to None.

    Raises:
        e: SQLite Exception.

    Returns:
        List: List of rows.
    """
    if not filter:
        filter = ""
    else:
        filter = " " + filter
    query = f"select {cols} from {self.name}{filter}"
    cursor = self.conn.cursor()
    try:
        response = cursor.execute(query)
        self.conn.commit()
    except Exception as e:
        print("\n\n", query, "\n\n")
        raise e
    else:
        return response.fetchall()

update_from_csv(datafile)

Reading from a CSV file, update the table rows.

Parameters:

Name Type Description Default
datafile Path

Path to file with new data.

required
Source code in minall/tables/base.py
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
def update_from_csv(self, datafile: Path):
    """Reading from a CSV file, update the table rows.

    Args:
        datafile (Path): Path to file with new data.
    """
    with open(datafile) as f:
        reader = csv.DictReader(f)
        for row in reader:
            placeholder = ", ".join(["?" for _ in range(len(row.items()))])
            cols_in_csv = ", ".join(row.keys())
            # Replace empty strings in values set with None
            values = []
            for v in row.values():
                if v == "":
                    values.append(None)
                else:
                    values.append(v)
            coalesce_stmt = self.coalesce_statement(row.keys())
            query = """
            INSERT INTO {table}({cols_in_csv})
            VALUES ({placeholder})
            ON CONFLICT ({pk})
            DO UPDATE SET {coalesce_stmt}
            """.format(
                table=self.name,
                cols_in_csv=cols_in_csv,
                placeholder=placeholder,
                pk=self.pk_str,
                coalesce_stmt=coalesce_stmt,
            )
            self.execute(query=query, values=tuple(values))

minall.tables.exceptions

Exceptions for validating CSV files used to create SQLite tables.

This module contains exceptions to manage the process of validating a CSV file given as input for the enrichment process. The module contains the following exceptions:

  • NoCSVHeaders - The CSV does not have headers.
  • NoURLColumn - When building the 'links' table, the declared URL column is not in the CSV file.
  • NoPrimaryKeyColumns - When building the 'shared_content' table, either the 'post_url' column or the 'content_url' column are missing from the CSV file.

When creating the 'links' table, the input CSV file must have a column for URLs; the URLs must be cleaned and/or ready to serve as the source for the data collection. The name of the URL column can vary and must be declared.

When creating the 'shared_content' table, the column names are not modifiable. The CSV must have the columns 'post_url' and 'content_url;' the former relates to a URL in the 'links' table, and the latter incidates a URL for content embedded in the Web Content of the former.

NoCSVHeaders

Bases: Exception

The CSV in-file lacks headers.

Source code in minall/tables/exceptions.py
17
18
19
20
21
22
class NoCSVHeaders(Exception):
    """The CSV in-file lacks headers."""

    def __init__(self) -> None:
        message = "No headers detected in CSV file."
        super().__init__(message)

NoPrimaryKeyColumns

Bases: Exception

The CSV in-file is missing a required column.

Source code in minall/tables/exceptions.py
33
34
35
36
37
38
class NoPrimaryKeyColumns(Exception):
    """The CSV in-file is missing a required column."""

    def __init__(self, col: str) -> None:
        message = f"Required primary key column '{col}' is not a header in the given CSV file."
        super().__init__(message)

NoURLColumn

Bases: Exception

The CSV in-file is missing a user-declared column.

Source code in minall/tables/exceptions.py
25
26
27
28
29
30
class NoURLColumn(Exception):
    """The CSV in-file is missing a user-declared column."""

    def __init__(self, url_col: str) -> None:
        message = f"The declared URL column '{url_col}' is not a header in the given CSV file."
        super().__init__(message)