Table tools

With SQLite, the module minall/tables manages the data during the enrichment process, from the data input at the start to the updated version exported at the end. The process relies on the following two tables:

The links table, which is the backbone of the enrichment, stores the target URLs and their enriched metadata.
The shared_content table, which is optional, stores URLs pointing to content shared via the target URLs' content.

As illustrated in the figure below, the two tables are related. The target URL (url) in the links table refers to the post_url in the shared_content table. A target URL (url) in the links table can share 0 or more items. Depending on the URLs dataset, it could be the case that no entities in the links table have shared any content. All entities in the shared_content must relate to at least one entity in the links table. Content in the shared_content table can have been shared by 1 or more URLs in the links table.

erDiagram
    LINKS }|--o{ SHARED_CONTENT : shares
    LINKS {
        text url PK
        text domain
        text work_type
        text duration
        text identifier
        text date_published
        text date_modified
        text country_of_origin
        text abstract
        text keywords
        text title
        text text
        text hashtags
        text creator_type
        text creator_date_created
        text creator_identifier
        integer creator_facebook_follow
        integer creator_facebook_subscribe
        integer creator_twitter_follow
        integer creator_youtube_subscribe
        integer creator_create_video
        text creator_name
        text creator_url
        integer facebook_comment
        integer facebook_like
        integer facebook_share
        integer pinterest_share
        integer twitter_share
        integer tiktok_share
        integer tiktok_comment
        integer reddit_engagement
        integer youtube_watch
        integer youtube_comment
        integer youtube_like
        integer youtube_favorite
        integer youtube_subscribe
        integer create_video
    }
    SHARED_CONTENT {
        text post_url PK
        text content_url PK
        text media_type
        integer height
        integer width
    }

`minall.tables.links`

`LinksConstants` `dataclass`

Dataclass to manage 'links' table.

This dataclass manages the 'links' table's required column names and their data types.

Attributes:

Name	Type	Description
`table_name`	`str`	Name of the table. Default = "links".
`primary_key`	`str`	Text string of primary key. Default = "url".
`pk_list`	`list`	List of primary key columns. Default = ["url"]
`dtypes`	`dict`	Key-value pairs of column names and SQLite data type descriptions.
`col_names`	`list`	List of column names.

Source code in minall/tables/links.py

@dataclass
class LinksConstants:
    """Dataclass to manage 'links' table.

    This dataclass manages the 'links' table's required column names and their data types.

    Attributes:
        table_name (str): Name of the table. Default = "links".
        primary_key (str): Text string of primary key. Default = "url".
        pk_list (list): List of primary key columns. Default = ["url"]
        dtypes (dict): Key-value pairs of column names and SQLite data type descriptions.
        col_names (list): List of column names.
    """

    table_name: str = "links"
    primary_key: str = "url"
    pk_list = ["url"]
    dtypes = {
        "url": "TEXT",
        "domain": "TEXT",
        "work_type": "TEXT",
        "duration": "TEXT",
        "identifier": "TEXT",
        "date_published": "TEXT",
        "date_modified": "TEXT",
        "country_of_origin": "TEXT",
        "abstract": "TEXT",
        "keywords": "TEXT",
        "title": "TEXT",
        "text": "TEXT",
        "hashtags": "TEXT",
        "creator_type": "TEXT",
        "creator_date_created": "TEXT",
        "creator_location_created": "TEXT",
        "creator_identifier": "TEXT",
        "creator_facebook_follow": "INTEGER",
        "creator_facebook_subscribe": "INTEGER",
        "creator_twitter_follow": "INTEGER",
        "creator_youtube_subscribe": "INTEGER",
        "creator_create_video": "INTEGER",
        "creator_name": "TEXT",
        "creator_url": "TEXT",
        "facebook_comment": "INTEGER",
        "facebook_like": "INTEGER",
        "facebook_share": "INTEGER",
        "pinterest_share": "INTEGER",
        "twitter_share": "INTEGER",
        "twitter_like": "INTEGER",
        "tiktok_share": "INTEGER",
        "tiktok_comment": "INTEGER",
        "reddit_engagement": "INTEGER",
        "youtube_watch": "INTEGER",
        "youtube_comment": "INTEGER",
        "youtube_like": "INTEGER",
        "youtube_favorite": "INTEGER",
        "youtube_subscribe": "INTEGER",
        "create_video": "INTEGER",
    }
    col_names = dtypes.keys()

`LinksTable`

Bases: BaseTable

Class for creating, updating, and reading SQL table for target URLs.

Source code in minall/tables/links.py

class LinksTable(BaseTable):
    """Class for creating, updating, and reading SQL table for target URLs."""

    dtypes = LinksConstants.dtypes
    name = LinksConstants.table_name
    pk_list = LinksConstants.pk_list

    def __init__(
        self, conn: Connection, infile: Path, outfile: Path, url_col: str | None = None
    ):
        """In database connection, create SQL table and populate with data from target URLs dataset file.

        Args:
            conn (Connection): SQLite connection.
            infile (Path): Path to URLs dataset file.
            url_col (str | None, optional): Column name of target URLs. Defaults to None.

        Raises:
            NoCSVHeaders: Dataset file does not have headers.
            KeyError: User did not define URL column and dataset file does not have default column 'url'.
            NoURLColumn: User-defined URL column not found in dataset file.
        """
        # Update the table's columns to include all in-file columns
        self.dtypes = self._parse_infile_columns(
            infile=infile, url_col=url_col, constant_cols=self.dtypes
        )
        # Inherit the parent base class
        super().__init__(
            name=self.name,
            pk=self.pk_list,
            dtypes=self.dtypes,
            conn=conn,
            outfile=outfile,
        )

        # Insert in-file data
        with open(infile) as f:
            reader = csv.DictReader(f)

            # Confirm the in-file is compatible with the table
            headers = reader.fieldnames
            if not headers:
                raise NoCSVHeaders()
            elif len(set(headers).difference(self.dtype_dict.keys())) > 0:
                raise KeyError()

            # Insert the in-file data
            for row in reader:
                if url_col:
                    row.update({"url": row[url_col]})
                placeholder = ", ".join(["?" for _ in range(len(row.items()))])
                cols, values = ", ".join(row.keys()), tuple(list(row.values()))
                query = """
                INSERT OR IGNORE INTO {table}({cols})
                VALUES ({placeholder})
                """.format(
                    table=self.name, cols=cols, placeholder=placeholder
                )
                self.execute(query=query, values=values)

    def _parse_infile_columns(
        self, infile: Path, constant_cols: Dict, url_col: str | None
    ) -> Dict:
        """During init method, modify table columns to include in-file's columns.

        Args:
            infile (Path): Path to URLs dataset.
            constant_cols (Dict): Key-value pairs of table's standard columns and data types.
            url_col (str | None, optional): Column name of target URLs.

        Raises:
            NoCSVHeaders: The infile does not have headers.
            KeyError: The infile does not have a recognizable URL column.
            NoURLColumn: The infile does not have the user-declared URL column.

        Returns:
            Dict: Key-value pairs of table's column names and data types.
        """
        with casanova.reader(infile) as reader:
            headers = reader.headers
        if not headers:
            raise NoCSVHeaders()
        elif not url_col and not "url" in headers:
            raise KeyError()
        elif url_col and not url_col in headers:
            raise NoURLColumn(url_col=url_col)
        dtypes = {col: "TEXT" for col in headers}
        return dtypes | constant_cols

`init(conn, infile, outfile, url_col=None)`

In database connection, create SQL table and populate with data from target URLs dataset file.

Parameters:

Name	Type	Description	Default
`conn`	`Connection`	SQLite connection.	required
`infile`	`Path`	Path to URLs dataset file.	required
`url_col`	`str \| None`	Column name of target URLs. Defaults to None.	`None`

Raises:

Type	Description
`NoCSVHeaders`	Dataset file does not have headers.
`KeyError`	User did not define URL column and dataset file does not have default column 'url'.
`NoURLColumn`	User-defined URL column not found in dataset file.

Source code in minall/tables/links.py

def __init__(
    self, conn: Connection, infile: Path, outfile: Path, url_col: str | None = None
):
    """In database connection, create SQL table and populate with data from target URLs dataset file.

    Args:
        conn (Connection): SQLite connection.
        infile (Path): Path to URLs dataset file.
        url_col (str | None, optional): Column name of target URLs. Defaults to None.

    Raises:
        NoCSVHeaders: Dataset file does not have headers.
        KeyError: User did not define URL column and dataset file does not have default column 'url'.
        NoURLColumn: User-defined URL column not found in dataset file.
    """
    # Update the table's columns to include all in-file columns
    self.dtypes = self._parse_infile_columns(
        infile=infile, url_col=url_col, constant_cols=self.dtypes
    )
    # Inherit the parent base class
    super().__init__(
        name=self.name,
        pk=self.pk_list,
        dtypes=self.dtypes,
        conn=conn,
        outfile=outfile,
    )

    # Insert in-file data
    with open(infile) as f:
        reader = csv.DictReader(f)

        # Confirm the in-file is compatible with the table
        headers = reader.fieldnames
        if not headers:
            raise NoCSVHeaders()
        elif len(set(headers).difference(self.dtype_dict.keys())) > 0:
            raise KeyError()

        # Insert the in-file data
        for row in reader:
            if url_col:
                row.update({"url": row[url_col]})
            placeholder = ", ".join(["?" for _ in range(len(row.items()))])
            cols, values = ", ".join(row.keys()), tuple(list(row.values()))
            query = """
            INSERT OR IGNORE INTO {table}({cols})
            VALUES ({placeholder})
            """.format(
                table=self.name, cols=cols, placeholder=placeholder
            )
            self.execute(query=query, values=values)

`minall.tables.shared_content`

`ShareContentConstants` `dataclass`

Dataclass to manage 'shared_content' table.

This dataclass manages the 'shared_content' table's required column names and their data types. Being a dataclass, however, the instance of the class can also be subsequently modified to include other column names (and their data types) according to the input data. The 'shared_content' table is meant to relate to the 'links' table, wherein the former's 'post_url' column refers to the latter's 'url' column.

Contrary to the 'links' table, whose primary key column can be derived from any declared target URL column in the input data, the 'shared_content' table requires the input data has the two columns that jointly compose its primary key, 'post_url' and 'content_url.'

Attributes:

Name	Type	Description
`table_name`	`str`	Name of the table. Default = "shared_content".
`primary_key`	`str`	Text string of composite primary key. Default = "post_url,content_url".
`pk_list`	`list`	List of comosite primary key columns. Default = ["post_url", "content_url]
`dtypes`	`dict`	Key-value pairs of column names and SQLite data type descriptions.
`col_names`	`list`	List of column names.

Source code in minall/tables/shared_content.py

@dataclass
class ShareContentConstants:
    """Dataclass to manage 'shared_content' table.

    This dataclass manages the 'shared_content' table's required column names and their data types. Being a dataclass, however, the instance of the class can also be subsequently modified to include other column names (and their data types) according to the input data. The 'shared_content' table is meant to relate to the 'links' table, wherein the former's 'post_url' column refers to the latter's 'url' column.

    Contrary to the 'links' table, whose primary key column can be derived from any declared target URL column in the input data, the 'shared_content' table requires the input data has the two columns that jointly compose its primary key, 'post_url' and 'content_url.'

    Attributes:
        table_name (str): Name of the table. Default = "shared_content".
        primary_key (str): Text string of composite primary key. Default = "post_url,content_url".
        pk_list (list): List of comosite primary key columns. Default = ["post_url", "content_url]
        dtypes (dict): Key-value pairs of column names and SQLite data type descriptions.
        col_names (list): List of column names.
    """

    table_name = "shared_content"
    primary_key = "post_url,content_url"
    pk_list = ["post_url", "content_url"]
    dtypes = {
        "post_url": f"TEXT REFERENCES {LinksConstants.table_name}(url) ON UPDATE CASCADE",
        "media_type": "TEXT",
        "content_url": "TEXT",
        "height": "INTEGER",
        "width": "INTEGER",
    }
    col_names = dtypes.keys()

`SharedContentTable`

Bases: BaseTable

Class for creating, updating, and reading SQL table for shared media content, each of which is related to 1 or more entities in the 'links' table.

Source code in minall/tables/shared_content.py

class SharedContentTable(BaseTable):
    """Class for creating, updating, and reading SQL table for shared media content, each of which is related to 1 or more entities in the 'links' table."""

    dtypes = ShareContentConstants.dtypes
    name = ShareContentConstants.table_name
    pk_list = ShareContentConstants.pk_list

    def __init__(self, conn: Connection, infile: Path | None, outfile: Path):
        """In database connection, create SQL table. If the user provides an existing shared_content.csv file, populate the table with that input.

        Args:
            conn (Connection): SQLite connection.
            infile (Path): Path to shared content dataset file.

        Raises:
            NoCSVHeaders: Dataset file does not have headers.
            NoPrimaryKeyColumns: One or more of the required columns ('post_url', 'content_url') is not in the dataset file.
        """
        # Update the table's columns to include all in-file columns
        if infile:
            self.dtypes = self._parse_infile_columns(
                infile=infile, constant_cols=self.dtypes
            )
        # Inherit the parent base class
        super().__init__(
            name=self.name,
            pk=self.pk_list,
            dtypes=self.dtypes,
            conn=conn,
            outfile=outfile,
        )

        # Insert in-file data
        if infile:
            with open(infile) as f:
                reader = csv.DictReader(f)

                # Confirm the in-file is compatible with the table
                headers = reader.fieldnames
                if not headers:
                    raise NoCSVHeaders()
                elif len(set(headers).difference(self.dtype_dict.keys())) > 0:
                    raise KeyError()

                # Insert the in-file data
                for row in reader:
                    placeholder = ", ".join(["?" for _ in range(len(row.items()))])
                    cols, values = ", ".join(row.keys()), tuple(list(row.values()))
                    query = """
                    INSERT OR IGNORE INTO {table}({cols})
                    VALUES ({placeholder})
                    """.format(
                        table=self.name, cols=cols, placeholder=placeholder
                    )
                    self.execute(query=query, values=values)

    def _parse_infile_columns(self, infile: Path, constant_cols: Dict) -> Dict:
        """During init method, modify table columns to include in-file's columns.

        Args:
            infile (Path): Path to URLs dataset.
            constant_cols (Dict): Key-value pairs of table's standard columns and data types.
            url_col (str | None, optional): Column name of target URLs.

        Raises:
            NoCSVHeaders: The infile does not have headers.
            KeyError: The infile does not have a recognizable URL column.
            NoURLColumn: The infile does not have the user-declared URL column.

        Returns:
            Dict: Key-value pairs of table's column names and data types.
        """
        with casanova.reader(infile) as reader:
            headers = reader.headers
        if not headers:
            raise NoCSVHeaders()
        diff = set(self.pk_list).difference(headers)
        if len(diff) > 0:
            raise NoPrimaryKeyColumns(col=list(diff)[0])
        dtypes = {col: "TEXT" for col in headers}
        return dtypes | constant_cols

`init(conn, infile, outfile)`

In database connection, create SQL table. If the user provides an existing shared_content.csv file, populate the table with that input.

Parameters:

Name	Type	Description	Default
`conn`	`Connection`	SQLite connection.	required
`infile`	`Path`	Path to shared content dataset file.	required

Raises:

Type	Description
`NoCSVHeaders`	Dataset file does not have headers.
`NoPrimaryKeyColumns`	One or more of the required columns ('post_url', 'content_url') is not in the dataset file.

Source code in minall/tables/shared_content.py

def __init__(self, conn: Connection, infile: Path | None, outfile: Path):
    """In database connection, create SQL table. If the user provides an existing shared_content.csv file, populate the table with that input.

    Args:
        conn (Connection): SQLite connection.
        infile (Path): Path to shared content dataset file.

    Raises:
        NoCSVHeaders: Dataset file does not have headers.
        NoPrimaryKeyColumns: One or more of the required columns ('post_url', 'content_url') is not in the dataset file.
    """
    # Update the table's columns to include all in-file columns
    if infile:
        self.dtypes = self._parse_infile_columns(
            infile=infile, constant_cols=self.dtypes
        )
    # Inherit the parent base class
    super().__init__(
        name=self.name,
        pk=self.pk_list,
        dtypes=self.dtypes,
        conn=conn,
        outfile=outfile,
    )

    # Insert in-file data
    if infile:
        with open(infile) as f:
            reader = csv.DictReader(f)

            # Confirm the in-file is compatible with the table
            headers = reader.fieldnames
            if not headers:
                raise NoCSVHeaders()
            elif len(set(headers).difference(self.dtype_dict.keys())) > 0:
                raise KeyError()

            # Insert the in-file data
            for row in reader:
                placeholder = ", ".join(["?" for _ in range(len(row.items()))])
                cols, values = ", ".join(row.keys()), tuple(list(row.values()))
                query = """
                INSERT OR IGNORE INTO {table}({cols})
                VALUES ({placeholder})
                """.format(
                    table=self.name, cols=cols, placeholder=placeholder
                )
                self.execute(query=query, values=values)

`minall.tables.base`

Create and execute queries on SQLite tables.

This module contains the class BaseTable that manages the SQLite database's tables. It contains the following methods:

`BaseTable`

Base class for SQLite tables.

Source code in minall/tables/base.py

class BaseTable:
    """Base class for SQLite tables."""

    def __init__(
        self, name: str, pk: List[str], conn: Connection, dtypes: Dict, outfile: Path
    ) -> None:
        """Create the SQL table with the given columns and data types.

        Args:
            name (str): Table name.
            pk (List[str]): List of primary keys.
            conn (Connection): SQLite connection.
            dtypes (Dict): Key-value pairs of column names and data types.
            outfile (Path): Path to CSV file where the table will be exported.
        """
        self.conn = conn
        self.name = name
        self.pk_list = pk
        self.pk_str = ",".join(pk)
        self.dtype_dict = dtypes
        self.outfile = outfile

        # Create the table
        self.execute(query=f"DROP TABLE IF EXISTS {self.name}")
        self.execute(query=self.create_query)

    def export(self, outfile: Path | None = None):
        """Write the SQL table to a CSV file.

        Args:
            outfile (Path | None, optional): Path to out-file. Defaults to None.
        """
        if not outfile:
            outfile = self.outfile
        cursor = self.conn.cursor()
        headers = [
            t[1]
            for t in cursor.execute(
                "SELECT * FROM pragma_table_info('{}');".format(self.name)
            ).fetchall()
        ]
        rows = cursor.execute(f"SELECT * FROM {self.name}").fetchall()
        with open(outfile, "w") as f:
            writer = csv.writer(f)
            writer.writerow(headers)
            for row in rows:
                writer.writerow(row)

    def update_from_csv(self, datafile: Path):
        """Reading from a CSV file, update the table rows.

        Args:
            datafile (Path): Path to file with new data.
        """
        with open(datafile) as f:
            reader = csv.DictReader(f)
            for row in reader:
                placeholder = ", ".join(["?" for _ in range(len(row.items()))])
                cols_in_csv = ", ".join(row.keys())
                # Replace empty strings in values set with None
                values = []
                for v in row.values():
                    if v == "":
                        values.append(None)
                    else:
                        values.append(v)
                coalesce_stmt = self.coalesce_statement(row.keys())
                query = """
                INSERT INTO {table}({cols_in_csv})
                VALUES ({placeholder})
                ON CONFLICT ({pk})
                DO UPDATE SET {coalesce_stmt}
                """.format(
                    table=self.name,
                    cols_in_csv=cols_in_csv,
                    placeholder=placeholder,
                    pk=self.pk_str,
                    coalesce_stmt=coalesce_stmt,
                )
                self.execute(query=query, values=tuple(values))

    def coalesce_statement(self, cols: Iterable[str]) -> str:
        """Compose SQL coalesce statement from columns to be updated.

        Examples:
            >>> # Set up connection and columns / data types for table.
            >>> from minall.utils.database import connect_to_database
            >>> columns_n_datatypes = {"url": "TEXT", "domain": "TEXT", "work_type": "TEXT"}
            >>>
            >>> # Create table.
            >>> table = BaseTable(name="test", pk=["url"], conn=connect_to_database(), dtypes=columns_n_datatypes, outfile=Path("test.csv"))
            >>>
            >>> # Compose SQL statement to replace table's row with new data.
            >>> table.coalesce_statement(cols=columns_n_datatypes.keys())
            'domain=COALESCE(excluded.domain, domain), work_type=COALESCE(excluded.work_type, work_type)'

        Args:
            cols (Iterable[str]): Row columns.

        Returns:
            str: SQL statement.
        """
        return ", ".join(
            [f"{k}=COALESCE(excluded.{k}, {k})" for k in cols if k not in self.pk_list]
        )

    @property
    def create_query(self) -> str:
        """SQL statement to create table.

        Examples:
            >>> # Set up connection and columns / data types for table.
            >>> from minall.utils.database import connect_to_database
            >>> columns_n_datatypes = {"url": "TEXT", "domain": "TEXT", "work_type": "TEXT"}
            >>>
            >>> # Create table.
            >>> table = BaseTable(name="test", pk=["url"], conn=connect_to_database(), dtypes=columns_n_datatypes, outfile=Path("test.csv"))
            >>>
            >>> # Compose SQL statement to create table.
            >>> table.create_query
            >>> 'CREATE TABLE IF NOT EXISTS test(url TEXT, domain TEXT, work_type TEXT, PRIMARY KEY (url))'


        Returns:
            str: _description_
        """
        cols = ", ".join([f"{k} {v}" for k, v in self.dtype_dict.items()])
        return (
            """CREATE TABLE IF NOT EXISTS {table}({cols}, PRIMARY KEY ({pk}))""".format(
                table=self.name, cols=cols, pk=self.pk_str
            )
        )

    def execute(self, query: str, values: Tuple | None = None):
        """Function to commit a query to the database connection.

        Args:
            query (str): SQL statement.
            values (Tuple | None, optional): Values to be inserted in the query's placeholders. Defaults to None.

        Raises:
            e: SQLite Exception.
        """
        cursor = self.conn.cursor()
        try:
            if values:
                cursor.execute(query, values)
            else:
                cursor.execute(query)
        except Exception as e:
            print("\n\n", query, "\n\n")
            raise e

    def select_from(self, cols: str, filter: str | None = None) -> List:
        """Function to select rows from the SQL table.

        Args:
            cols (str): Target of SELECT statement.
            filter (str | None, optional): Where condition to apply after FROM statement. Defaults to None.

        Raises:
            e: SQLite Exception.

        Returns:
            List: List of rows.
        """
        if not filter:
            filter = ""
        else:
            filter = " " + filter
        query = f"select {cols} from {self.name}{filter}"
        cursor = self.conn.cursor()
        try:
            response = cursor.execute(query)
            self.conn.commit()
        except Exception as e:
            print("\n\n", query, "\n\n")
            raise e
        else:
            return response.fetchall()

`create_query: str` `property`

SQL statement to create table.

Examples:

>>> # Set up connection and columns / data types for table.
>>> from minall.utils.database import connect_to_database
>>> columns_n_datatypes = {"url": "TEXT", "domain": "TEXT", "work_type": "TEXT"}
>>>
>>> # Create table.
>>> table = BaseTable(name="test", pk=["url"], conn=connect_to_database(), dtypes=columns_n_datatypes, outfile=Path("test.csv"))
>>>
>>> # Compose SQL statement to create table.
>>> table.create_query
>>> 'CREATE TABLE IF NOT EXISTS test(url TEXT, domain TEXT, work_type TEXT, PRIMARY KEY (url))'

Returns:

Name	Type	Description
`str`	`str`	description

`init(name, pk, conn, dtypes, outfile)`

Create the SQL table with the given columns and data types.

Parameters:

Name	Type	Description	Default
`name`	`str`	Table name.	required
`pk`	`List[str]`	List of primary keys.	required
`conn`	`Connection`	SQLite connection.	required
`dtypes`	`Dict`	Key-value pairs of column names and data types.	required
`outfile`	`Path`	Path to CSV file where the table will be exported.	required

Source code in minall/tables/base.py

def __init__(
    self, name: str, pk: List[str], conn: Connection, dtypes: Dict, outfile: Path
) -> None:
    """Create the SQL table with the given columns and data types.

    Args:
        name (str): Table name.
        pk (List[str]): List of primary keys.
        conn (Connection): SQLite connection.
        dtypes (Dict): Key-value pairs of column names and data types.
        outfile (Path): Path to CSV file where the table will be exported.
    """
    self.conn = conn
    self.name = name
    self.pk_list = pk
    self.pk_str = ",".join(pk)
    self.dtype_dict = dtypes
    self.outfile = outfile

    # Create the table
    self.execute(query=f"DROP TABLE IF EXISTS {self.name}")
    self.execute(query=self.create_query)

`coalesce_statement(cols)`

Compose SQL coalesce statement from columns to be updated.

Examples:

>>> # Set up connection and columns / data types for table.
>>> from minall.utils.database import connect_to_database
>>> columns_n_datatypes = {"url": "TEXT", "domain": "TEXT", "work_type": "TEXT"}
>>>
>>> # Create table.
>>> table = BaseTable(name="test", pk=["url"], conn=connect_to_database(), dtypes=columns_n_datatypes, outfile=Path("test.csv"))
>>>
>>> # Compose SQL statement to replace table's row with new data.
>>> table.coalesce_statement(cols=columns_n_datatypes.keys())
'domain=COALESCE(excluded.domain, domain), work_type=COALESCE(excluded.work_type, work_type)'

Parameters:

Name	Type	Description	Default
`cols`	`Iterable[str]`	Row columns.	required

Returns:

Name	Type	Description
`str`	`str`	SQL statement.

Source code in minall/tables/base.py

def coalesce_statement(self, cols: Iterable[str]) -> str:
    """Compose SQL coalesce statement from columns to be updated.

    Examples:
        >>> # Set up connection and columns / data types for table.
        >>> from minall.utils.database import connect_to_database
        >>> columns_n_datatypes = {"url": "TEXT", "domain": "TEXT", "work_type": "TEXT"}
        >>>
        >>> # Create table.
        >>> table = BaseTable(name="test", pk=["url"], conn=connect_to_database(), dtypes=columns_n_datatypes, outfile=Path("test.csv"))
        >>>
        >>> # Compose SQL statement to replace table's row with new data.
        >>> table.coalesce_statement(cols=columns_n_datatypes.keys())
        'domain=COALESCE(excluded.domain, domain), work_type=COALESCE(excluded.work_type, work_type)'

    Args:
        cols (Iterable[str]): Row columns.

    Returns:
        str: SQL statement.
    """
    return ", ".join(
        [f"{k}=COALESCE(excluded.{k}, {k})" for k in cols if k not in self.pk_list]
    )

`execute(query, values=None)`

Function to commit a query to the database connection.

Parameters:

Name	Type	Description	Default
`query`	`str`	SQL statement.	required
`values`	`Tuple \| None`	Values to be inserted in the query's placeholders. Defaults to None.	`None`

Raises:

Type	Description
`e`	SQLite Exception.

Source code in minall/tables/base.py

def execute(self, query: str, values: Tuple | None = None):
    """Function to commit a query to the database connection.

    Args:
        query (str): SQL statement.
        values (Tuple | None, optional): Values to be inserted in the query's placeholders. Defaults to None.

    Raises:
        e: SQLite Exception.
    """
    cursor = self.conn.cursor()
    try:
        if values:
            cursor.execute(query, values)
        else:
            cursor.execute(query)
    except Exception as e:
        print("\n\n", query, "\n\n")
        raise e

`export(outfile=None)`

Write the SQL table to a CSV file.

Parameters:

Name	Type	Description	Default
`outfile`	`Path \| None`	Path to out-file. Defaults to None.	`None`

Source code in minall/tables/base.py

def export(self, outfile: Path | None = None):
    """Write the SQL table to a CSV file.

    Args:
        outfile (Path | None, optional): Path to out-file. Defaults to None.
    """
    if not outfile:
        outfile = self.outfile
    cursor = self.conn.cursor()
    headers = [
        t[1]
        for t in cursor.execute(
            "SELECT * FROM pragma_table_info('{}');".format(self.name)
        ).fetchall()
    ]
    rows = cursor.execute(f"SELECT * FROM {self.name}").fetchall()
    with open(outfile, "w") as f:
        writer = csv.writer(f)
        writer.writerow(headers)
        for row in rows:
            writer.writerow(row)

`select_from(cols, filter=None)`

Function to select rows from the SQL table.

Parameters:

Name	Type	Description	Default
`cols`	`str`	Target of SELECT statement.	required
`filter`	`str \| None`	Where condition to apply after FROM statement. Defaults to None.	`None`

Raises:

Type	Description
`e`	SQLite Exception.

Returns:

Name	Type	Description
`List`	`List`	List of rows.

Source code in minall/tables/base.py

def select_from(self, cols: str, filter: str | None = None) -> List:
    """Function to select rows from the SQL table.

    Args:
        cols (str): Target of SELECT statement.
        filter (str | None, optional): Where condition to apply after FROM statement. Defaults to None.

    Raises:
        e: SQLite Exception.

    Returns:
        List: List of rows.
    """
    if not filter:
        filter = ""
    else:
        filter = " " + filter
    query = f"select {cols} from {self.name}{filter}"
    cursor = self.conn.cursor()
    try:
        response = cursor.execute(query)
        self.conn.commit()
    except Exception as e:
        print("\n\n", query, "\n\n")
        raise e
    else:
        return response.fetchall()

`update_from_csv(datafile)`

Reading from a CSV file, update the table rows.

Parameters:

Name	Type	Description	Default
`datafile`	`Path`	Path to file with new data.	required

Source code in minall/tables/base.py

def update_from_csv(self, datafile: Path):
    """Reading from a CSV file, update the table rows.

    Args:
        datafile (Path): Path to file with new data.
    """
    with open(datafile) as f:
        reader = csv.DictReader(f)
        for row in reader:
            placeholder = ", ".join(["?" for _ in range(len(row.items()))])
            cols_in_csv = ", ".join(row.keys())
            # Replace empty strings in values set with None
            values = []
            for v in row.values():
                if v == "":
                    values.append(None)
                else:
                    values.append(v)
            coalesce_stmt = self.coalesce_statement(row.keys())
            query = """
            INSERT INTO {table}({cols_in_csv})
            VALUES ({placeholder})
            ON CONFLICT ({pk})
            DO UPDATE SET {coalesce_stmt}
            """.format(
                table=self.name,
                cols_in_csv=cols_in_csv,
                placeholder=placeholder,
                pk=self.pk_str,
                coalesce_stmt=coalesce_stmt,
            )
            self.execute(query=query, values=tuple(values))

`minall.tables.exceptions`

Exceptions for validating CSV files used to create SQLite tables.

This module contains exceptions to manage the process of validating a CSV file given as input for the enrichment process. The module contains the following exceptions:

NoCSVHeaders - The CSV does not have headers.
NoURLColumn - When building the 'links' table, the declared URL column is not in the CSV file.
NoPrimaryKeyColumns - When building the 'shared_content' table, either the 'post_url' column or the 'content_url' column are missing from the CSV file.

When creating the 'links' table, the input CSV file must have a column for URLs; the URLs must be cleaned and/or ready to serve as the source for the data collection. The name of the URL column can vary and must be declared.

When creating the 'shared_content' table, the column names are not modifiable. The CSV must have the columns 'post_url' and 'content_url;' the former relates to a URL in the 'links' table, and the latter incidates a URL for content embedded in the Web Content of the former.

`NoCSVHeaders`

Bases: Exception

The CSV in-file lacks headers.

Source code in minall/tables/exceptions.py

class NoCSVHeaders(Exception):
    """The CSV in-file lacks headers."""

    def __init__(self) -> None:
        message = "No headers detected in CSV file."
        super().__init__(message)

`NoPrimaryKeyColumns`

Bases: Exception

The CSV in-file is missing a required column.

Source code in minall/tables/exceptions.py

class NoPrimaryKeyColumns(Exception):
    """The CSV in-file is missing a required column."""

    def __init__(self, col: str) -> None:
        message = f"Required primary key column '{col}' is not a header in the given CSV file."
        super().__init__(message)

`NoURLColumn`

Bases: Exception

The CSV in-file is missing a user-declared column.

Source code in minall/tables/exceptions.py

class NoURLColumn(Exception):
    """The CSV in-file is missing a user-declared column."""

    def __init__(self, url_col: str) -> None:
        message = f"The declared URL column '{url_col}' is not a header in the given CSV file."
        super().__init__(message)

Table tools

minall.tables.links

LinksConstants dataclass

LinksTable

__init__(conn, infile, outfile, url_col=None)

minall.tables.shared_content

ShareContentConstants dataclass

SharedContentTable

__init__(conn, infile, outfile)

minall.tables.base

BaseTable

create_query: str property

__init__(name, pk, conn, dtypes, outfile)

coalesce_statement(cols)

execute(query, values=None)

export(outfile=None)

select_from(cols, filter=None)

update_from_csv(datafile)

minall.tables.exceptions

NoCSVHeaders

NoPrimaryKeyColumns

NoURLColumn

`minall.tables.links`

`LinksConstants` `dataclass`

`LinksTable`

`init(conn, infile, outfile, url_col=None)`

`minall.tables.shared_content`

`ShareContentConstants` `dataclass`

`SharedContentTable`

`init(conn, infile, outfile)`

`minall.tables.base`

`BaseTable`

`create_query: str` `property`

`init(name, pk, conn, dtypes, outfile)`

`coalesce_statement(cols)`

`execute(query, values=None)`

`export(outfile=None)`

`select_from(cols, filter=None)`

`update_from_csv(datafile)`

`minall.tables.exceptions`

`NoCSVHeaders`

`NoPrimaryKeyColumns`

`NoURLColumn`