Datasets

`FSDataset`

from tempora.datasets import FSDataset

File system-backed dataset.

FSDataset(
    source: str | Path | list[str | Path],
    file_format: Literal['parquet', 'csv', 'json', 'feather', 'orc'] | None = None,
    file_schema: pa.Schema | None = None,
    columns: list[str] | None = None,
    read_options: ReadOptions | None = None,
    parse_options: ParseOptions | None = None,
    convert_options: ConvertOptions | None = None,
    schema_inf_depth: float | None = None,
    partitioning: Literal['hive'] | Partitioning | None = None,
    fs_config: dict[str, Any] | None = None,
    dataset_name: str | None = None,
    time_column: str | None = None,
    entity_keys: list[str] | None = None,
    pivot: Pivot | None = None,
    materialize: bool = False,
    targets: bool = False
)

Parameters

Name	Description
`source`	File path(s). Use `s3://`, `gcs://`, `hdfs://` for remote storage. Use `server:` prefix for datasets stored on the Tempora server.
`file_format`	File format or `None` to infer from extensions.
`file_schema`	PyArrow `Schema` for the dataset.
`columns`	Column subset to load.
`read_options`	Read Options object.
`parse_options`	Parse Options object.
`convert_options`	Convert Options object.
`schema_inf_depth`	MiB of data to use for schema inference (default 8 MiB).
`partitioning`	Partitioning configuration (`'hive'` or Partitioning Options).
`fs_config`	File system configuration passed to PyArrow LocalFileSystem, GcsFileSystem, or HadoopFileSystem.
`dataset_name`	Optional dataset name.
`time_column`	Name of the time/sequence column (optional).
`entity_keys`	Primary key column(s) identifying entities.
`pivot`	Optional Pivot settings.
`materialize`	Whether to materialize on the server.
`targets`	Whether this dataset contains target data.

Properties

Name	Description
`schema`	PyArrow `Schema` for the dataset (fetched from server).
`num_rows`	Number of rows in the dataset.
`fs_type`	File system type for the dataset (`'server'`, `'s3'`, `'gcs'`, `'hdfs'`, `'local'`).

Methods

Name	Description
`df`	Alias for `to_pandas`.
`drop`	Drop the dataset from the server.
`filter`	Filter the dataset.
`head`	Return the first `n` rows.
`join`	Join with another dataset.
`np`	Alias for `to_numpy`.
`to_arrow`	Return the dataset as a PyArrow `Table` (materialized) or `RecordBatchReader`.
`to_numpy`	Return the dataset as a NumPy array (if possible).
`to_pandas`	Return the dataset as a Pandas DataFrame.
`write_dataset`	Write the dataset to a file system using PyArrow `write_dataset`.

df

df(
    use_nullable_dtypes: bool = False
) -> pandas.DataFrame

Alias for to_pandas.

Parameter Name	Description
`use_nullable_dtypes`	Use pandas nullable dtypes if True.

drop

drop() -> None

Drop the dataset from the server.

filter

filter(
    ts_filter: str | None = None,
    /,
    *,
    columns: list[str] | None = None,
    materialize: bool = False
) -> FilteredFSDataset

Filter the dataset.

Parameter Name	Description
`ts_filter`	SQL WHERE-style filter.
`columns`	Column subset.
`materialize`	Materialize on server if True.

head

head(
    n: int = 10,
    *,
    as_arrow: bool = False
) -> pandas.DataFrame | pyarrow.Table

Return the first n rows.

Parameter Name	Description
`n`	Rows to return.
`as_arrow`	Return PyArrow `Table` if True.

join

join(
    dataset: Dataset,
    join_condition: str | list[str],
    *,
    asof_join: bool = False,
    direction: str = 'forward',
    allow_exact_matches: bool = True
) -> Dataset

Join with another dataset.

Parameter Name	Description
`dataset`	Dataset to join.
`join_condition`	Column list or SQL-style condition.
`asof_join`	ASOF join on time columns.
`direction`	Forward or backward.
`allow_exact_matches`	Disallow exact matches if False.

np

np() -> numpy.ndarray

Alias for to_numpy.

to_arrow

to_arrow(
    materialize: bool = True
) -> pyarrow.Table | pyarrow.RecordBatchReader

Return the dataset as a PyArrow Table (materialized) or RecordBatchReader.

Parameter Name	Description
`materialize`	Return `Table` if True, otherwise `RecordBatchReader`.

to_numpy

to_numpy() -> numpy.ndarray

Return the dataset as a NumPy array (if possible).

to_pandas

to_pandas(
    use_nullable_dtypes: bool = False
) -> pandas.DataFrame

Return the dataset as a Pandas DataFrame.

Parameter Name	Description
`use_nullable_dtypes`	Use pandas nullable dtypes if True.

write_dataset

write_dataset(*args, **kwargs) -> None

Write the dataset to a file system using PyArrow write_dataset.

Parameter Name	Description
`*args`	Passed to pa.ds.write_dataset.
`**kwargs`	Passed to pa.ds.write_dataset.

`SnowflakeDataset`

from tempora.datasets import SnowflakeDataset

Snowflake-backed dataset.

SnowflakeDataset(
    database: str,
    db_schema: str,
    table: str,
    time_column: str | None = None,
    entity_keys: list[str] | None = None,
    pivot: Pivot | None = None,
    materialize: bool = False,
    targets: bool = False
)

Parameters

Name	Description
`database`	Snowflake database name.
`db_schema`	Snowflake schema name.
`table`	Table name.
`time_column`	Name of the time/sequence column (optional).
`entity_keys`	Primary key column(s) identifying entities.
`pivot`	Optional Pivot settings.
`materialize`	Whether to materialize on the server.
`targets`	Whether this dataset contains target data.

Properties

Name	Description
`schema`	PyArrow `Schema` for the dataset (fetched from server).
`num_rows`	Number of rows in the dataset.

Methods

Name	Description
`df`	Alias for `to_pandas`.
`drop`	Drop the dataset from the server.
`filter`	Filter the dataset.
`head`	Return the first `n` rows.
`join`	Join with another dataset.
`np`	Alias for `to_numpy`.
`to_arrow`	Return the dataset as a PyArrow `Table` (materialized) or `RecordBatchReader`.
`to_numpy`	Return the dataset as a NumPy array (if possible).
`to_pandas`	Return the dataset as a Pandas DataFrame.
`write_dataset`	Write the dataset to a file system using PyArrow `write_dataset`.

df

df(
    use_nullable_dtypes: bool = False
) -> pandas.DataFrame

Alias for to_pandas.

Parameter Name	Description
`use_nullable_dtypes`	Use pandas nullable dtypes if True.

drop

drop() -> None

Drop the dataset from the server.

filter

filter(
    ts_filter: str | None = None,
    /,
    *,
    columns: list[str] | None = None,
    materialize: bool = False
) -> FilteredSnowflakeDataset

Filter the dataset.

Parameter Name	Description
`ts_filter`	SQL WHERE-style filter.
`columns`	Column subset.
`materialize`	Materialize as a temporary table in Snowflake if True.

head

head(
    n: int = 10,
    *,
    as_arrow: bool = False
) -> pandas.DataFrame | pyarrow.Table

Return the first n rows.

Parameter Name	Description
`n`	Rows to return.
`as_arrow`	Return PyArrow `Table` if True.

join

join(
    dataset: Dataset,
    join_condition: str | list[str],
    *,
    asof_join: bool = False,
    direction: str = 'forward',
    allow_exact_matches: bool = True
) -> Dataset

Join with another dataset.

Parameter Name	Description
`dataset`	Dataset to join.
`join_condition`	Column list or SQL-style condition.
`asof_join`	ASOF join on time columns.
`direction`	Forward or backward.
`allow_exact_matches`	Disallow exact matches if False.

np

np() -> numpy.ndarray

Alias for to_numpy.

to_arrow

to_arrow(
    materialize: bool = True
) -> pyarrow.Table | pyarrow.RecordBatchReader

Return the dataset as a PyArrow Table (materialized) or RecordBatchReader.

Parameter Name	Description
`materialize`	Return `Table` if True, otherwise `RecordBatchReader`.

to_numpy

to_numpy() -> numpy.ndarray

Return the dataset as a NumPy array (if possible).

to_pandas

to_pandas(
    use_nullable_dtypes: bool = False
) -> pandas.DataFrame

Return the dataset as a Pandas DataFrame.

Parameter Name	Description
`use_nullable_dtypes`	Use pandas nullable dtypes if True.

write_dataset

write_dataset(*args, **kwargs) -> None

Write the dataset to a file system using PyArrow write_dataset.

Parameter Name	Description
`*args`	Passed to pa.ds.write_dataset.
`**kwargs`	Passed to pa.ds.write_dataset.

`connect_to_snowflake`

from tempora.datasets.snowflake import connect_to_snowflake

connect_to_snowflake(
    user: str | None = None,
    password: str | None = None,
    account: str | None = None,
    warehouse: str | None = None,
    session_parameters: dict[str, str] | None = None,
    connection_name: str | None = None
) -> None

Open a connection to Snowflake.

Parameter Name	Description
`user`	Login name. If not provided, uses `SNOWFLAKE_USER`.
`password`	Password. If not provided, uses `SNOWFLAKE_PASSWORD`.
`account`	Snowflake account identifier. If not provided, uses `SNOWFLAKE_ACCOUNT`.
`warehouse`	Default warehouse name.
`session_parameters`	Session-level parameters.
`connection_name`	Name of a connection profile in `connections.toml` to load defaults from.

`close_snowflake_connection`

from tempora.datasets.snowflake import close_snowflake_connection

close_snowflake_connection() -> None

Close the currently open Snowflake connection.

`Pivot`

from tempora.datasets import Pivot

Pivot settings for datasets in EAV/long format.

Pivot(
    on: str,
    using: str,
    agg_function: str | dict[str, str],
    in_values: list[str] | None = None,
    dtype: str | np.dtype | dict[str, str | np.dtype] | None = None,
    errors: Literal['raise', 'coerce'] = 'coerce'
)

Parameters

Name	Description
`on`	Column whose values become new column names.
`using`	Column providing values for the new columns.
`agg_function`	Aggregate function (single function or per-column map).
`in_values`	Restrict which values of `on` become columns.
`dtype`	Target dtype(s) for new columns.
`errors`	Error handling for dtype conversion.