Table tools
With SQLite, the module minall/tables
manages the data during the enrichment process, from the data input at the start to the updated version exported at the end. The process relies on the following two tables:
-
The
links
table, which is the backbone of the enrichment, stores the target URLs and their enriched metadata. -
The
shared_content
table, which is optional, stores URLs pointing to content shared via the target URLs' content.
As illustrated in the figure below, the two tables are related. The target URL (url
) in the links
table refers to the post_url
in the shared_content
table. A target URL (url
) in the links
table can share 0 or more items. Depending on the URLs dataset, it could be the case that no entities in the links
table have shared any content. All entities in the shared_content
must relate to at least one entity in the links
table. Content in the shared_content
table can have been shared by 1 or more URLs in the links
table.
erDiagram
LINKS }|--o{ SHARED_CONTENT : shares
LINKS {
text url PK
text domain
text work_type
text duration
text identifier
text date_published
text date_modified
text country_of_origin
text abstract
text keywords
text title
text text
text hashtags
text creator_type
text creator_date_created
text creator_identifier
integer creator_facebook_follow
integer creator_facebook_subscribe
integer creator_twitter_follow
integer creator_youtube_subscribe
integer creator_create_video
text creator_name
text creator_url
integer facebook_comment
integer facebook_like
integer facebook_share
integer pinterest_share
integer twitter_share
integer tiktok_share
integer tiktok_comment
integer reddit_engagement
integer youtube_watch
integer youtube_comment
integer youtube_like
integer youtube_favorite
integer youtube_subscribe
integer create_video
}
SHARED_CONTENT {
text post_url PK
text content_url PK
text media_type
integer height
integer width
}
minall.tables.links
LinksConstants
dataclass
Dataclass to manage 'links' table.
This dataclass manages the 'links' table's required column names and their data types.
Attributes:
Name | Type | Description |
---|---|---|
table_name |
str
|
Name of the table. Default = "links". |
primary_key |
str
|
Text string of primary key. Default = "url". |
pk_list |
list
|
List of primary key columns. Default = ["url"] |
dtypes |
dict
|
Key-value pairs of column names and SQLite data type descriptions. |
col_names |
list
|
List of column names. |
Source code in minall/tables/links.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
|
LinksTable
Bases: BaseTable
Class for creating, updating, and reading SQL table for target URLs.
Source code in minall/tables/links.py
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
|
__init__(conn, infile, outfile, url_col=None)
In database connection, create SQL table and populate with data from target URLs dataset file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
conn |
Connection
|
SQLite connection. |
required |
infile |
Path
|
Path to URLs dataset file. |
required |
url_col |
str | None
|
Column name of target URLs. Defaults to None. |
None
|
Raises:
Type | Description |
---|---|
NoCSVHeaders
|
Dataset file does not have headers. |
KeyError
|
User did not define URL column and dataset file does not have default column 'url'. |
NoURLColumn
|
User-defined URL column not found in dataset file. |
Source code in minall/tables/links.py
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
|
minall.tables.shared_content
ShareContentConstants
dataclass
Dataclass to manage 'shared_content' table.
This dataclass manages the 'shared_content' table's required column names and their data types. Being a dataclass, however, the instance of the class can also be subsequently modified to include other column names (and their data types) according to the input data. The 'shared_content' table is meant to relate to the 'links' table, wherein the former's 'post_url' column refers to the latter's 'url' column.
Contrary to the 'links' table, whose primary key column can be derived from any declared target URL column in the input data, the 'shared_content' table requires the input data has the two columns that jointly compose its primary key, 'post_url' and 'content_url.'
Attributes:
Name | Type | Description |
---|---|---|
table_name |
str
|
Name of the table. Default = "shared_content". |
primary_key |
str
|
Text string of composite primary key. Default = "post_url,content_url". |
pk_list |
list
|
List of comosite primary key columns. Default = ["post_url", "content_url] |
dtypes |
dict
|
Key-value pairs of column names and SQLite data type descriptions. |
col_names |
list
|
List of column names. |
Source code in minall/tables/shared_content.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
|
SharedContentTable
Bases: BaseTable
Class for creating, updating, and reading SQL table for shared media content, each of which is related to 1 or more entities in the 'links' table.
Source code in minall/tables/shared_content.py
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
|
__init__(conn, infile, outfile)
In database connection, create SQL table. If the user provides an existing shared_content.csv file, populate the table with that input.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
conn |
Connection
|
SQLite connection. |
required |
infile |
Path
|
Path to shared content dataset file. |
required |
Raises:
Type | Description |
---|---|
NoCSVHeaders
|
Dataset file does not have headers. |
NoPrimaryKeyColumns
|
One or more of the required columns ('post_url', 'content_url') is not in the dataset file. |
Source code in minall/tables/shared_content.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
|
minall.tables.base
Create and execute queries on SQLite tables.
This module contains the class BaseTable
that manages the SQLite database's tables. It contains the following methods:
BaseTable
Base class for SQLite tables.
Source code in minall/tables/base.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
|
create_query: str
property
SQL statement to create table.
Examples:
>>> # Set up connection and columns / data types for table.
>>> from minall.utils.database import connect_to_database
>>> columns_n_datatypes = {"url": "TEXT", "domain": "TEXT", "work_type": "TEXT"}
>>>
>>> # Create table.
>>> table = BaseTable(name="test", pk=["url"], conn=connect_to_database(), dtypes=columns_n_datatypes, outfile=Path("test.csv"))
>>>
>>> # Compose SQL statement to create table.
>>> table.create_query
>>> 'CREATE TABLE IF NOT EXISTS test(url TEXT, domain TEXT, work_type TEXT, PRIMARY KEY (url))'
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
description |
__init__(name, pk, conn, dtypes, outfile)
Create the SQL table with the given columns and data types.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
Table name. |
required |
pk |
List[str]
|
List of primary keys. |
required |
conn |
Connection
|
SQLite connection. |
required |
dtypes |
Dict
|
Key-value pairs of column names and data types. |
required |
outfile |
Path
|
Path to CSV file where the table will be exported. |
required |
Source code in minall/tables/base.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
|
coalesce_statement(cols)
Compose SQL coalesce statement from columns to be updated.
Examples:
>>> # Set up connection and columns / data types for table.
>>> from minall.utils.database import connect_to_database
>>> columns_n_datatypes = {"url": "TEXT", "domain": "TEXT", "work_type": "TEXT"}
>>>
>>> # Create table.
>>> table = BaseTable(name="test", pk=["url"], conn=connect_to_database(), dtypes=columns_n_datatypes, outfile=Path("test.csv"))
>>>
>>> # Compose SQL statement to replace table's row with new data.
>>> table.coalesce_statement(cols=columns_n_datatypes.keys())
'domain=COALESCE(excluded.domain, domain), work_type=COALESCE(excluded.work_type, work_type)'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cols |
Iterable[str]
|
Row columns. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
SQL statement. |
Source code in minall/tables/base.py
95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
|
execute(query, values=None)
Function to commit a query to the database connection.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query |
str
|
SQL statement. |
required |
values |
Tuple | None
|
Values to be inserted in the query's placeholders. Defaults to None. |
None
|
Raises:
Type | Description |
---|---|
e
|
SQLite Exception. |
Source code in minall/tables/base.py
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
|
export(outfile=None)
Write the SQL table to a CSV file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
outfile |
Path | None
|
Path to out-file. Defaults to None. |
None
|
Source code in minall/tables/base.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
|
select_from(cols, filter=None)
Function to select rows from the SQL table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cols |
str
|
Target of SELECT statement. |
required |
filter |
str | None
|
Where condition to apply after FROM statement. Defaults to None. |
None
|
Raises:
Type | Description |
---|---|
e
|
SQLite Exception. |
Returns:
Name | Type | Description |
---|---|---|
List |
List
|
List of rows. |
Source code in minall/tables/base.py
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
|
update_from_csv(datafile)
Reading from a CSV file, update the table rows.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datafile |
Path
|
Path to file with new data. |
required |
Source code in minall/tables/base.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 |
|
minall.tables.exceptions
Exceptions for validating CSV files used to create SQLite tables.
This module contains exceptions to manage the process of validating a CSV file given as input for the enrichment process. The module contains the following exceptions:
NoCSVHeaders
- The CSV does not have headers.NoURLColumn
- When building the 'links' table, the declared URL column is not in the CSV file.NoPrimaryKeyColumns
- When building the 'shared_content' table, either the 'post_url' column or the 'content_url' column are missing from the CSV file.
When creating the 'links' table, the input CSV file must have a column for URLs; the URLs must be cleaned and/or ready to serve as the source for the data collection. The name of the URL column can vary and must be declared.
When creating the 'shared_content' table, the column names are not modifiable. The CSV must have the columns 'post_url' and 'content_url;' the former relates to a URL in the 'links' table, and the latter incidates a URL for content embedded in the Web Content of the former.
NoCSVHeaders
Bases: Exception
The CSV in-file lacks headers.
Source code in minall/tables/exceptions.py
17 18 19 20 21 22 |
|
NoPrimaryKeyColumns
Bases: Exception
The CSV in-file is missing a required column.
Source code in minall/tables/exceptions.py
33 34 35 36 37 38 |
|
NoURLColumn
Bases: Exception
The CSV in-file is missing a user-declared column.
Source code in minall/tables/exceptions.py
25 26 27 28 29 30 |
|