Skip to content

Datasets

FSDataset

from tempora.datasets import FSDataset

File system-backed dataset.

FSDataset(
    source: str | Path | list[str | Path],
    file_format: Literal['parquet', 'csv', 'json', 'feather', 'orc'] | None = None,
    file_schema: pa.Schema | None = None,
    columns: list[str] | None = None,
    read_options: ReadOptions | None = None,
    parse_options: ParseOptions | None = None,
    convert_options: ConvertOptions | None = None,
    schema_inf_depth: float | None = None,
    partitioning: Literal['hive'] | Partitioning | None = None,
    fs_config: dict[str, Any] | None = None,
    dataset_name: str | None = None,
    time_column: str | None = None,
    entity_keys: list[str] | None = None,
    pivot: Pivot | None = None,
    materialize: bool = False,
    targets: bool = False
)

Parameters

Name Description
source File path(s). Use s3://, gcs://, hdfs:// for remote storage. Use server: prefix for datasets stored on the Tempora server.
file_format File format or None to infer from extensions.
file_schema PyArrow Schema for the dataset.
columns Column subset to load.
read_options Read Options object.
parse_options Parse Options object.
convert_options Convert Options object.
schema_inf_depth MiB of data to use for schema inference (default 8 MiB).
partitioning Partitioning configuration ('hive' or Partitioning Options).
fs_config File system configuration passed to PyArrow LocalFileSystem, GcsFileSystem, or HadoopFileSystem.
dataset_name Optional dataset name.
time_column Name of the time/sequence column (optional).
entity_keys Primary key column(s) identifying entities.
pivot Optional Pivot settings.
materialize Whether to materialize on the server.
targets Whether this dataset contains target data.

Properties

Name Description
schema PyArrow Schema for the dataset (fetched from server).
num_rows Number of rows in the dataset.
fs_type File system type for the dataset ('server', 's3', 'gcs', 'hdfs', 'local').

Methods

Name Description
df Alias for to_pandas.
drop Drop the dataset from the server.
filter Filter the dataset.
head Return the first n rows.
join Join with another dataset.
np Alias for to_numpy.
to_arrow Return the dataset as a PyArrow Table (materialized) or RecordBatchReader.
to_numpy Return the dataset as a NumPy array (if possible).
to_pandas Return the dataset as a Pandas DataFrame.
write_dataset Write the dataset to a file system using PyArrow write_dataset.

df

df(
    use_nullable_dtypes: bool = False
) -> pandas.DataFrame

Alias for to_pandas.

Parameter Name Description
use_nullable_dtypes Use pandas nullable dtypes if True.

drop

drop() -> None

Drop the dataset from the server.

filter

filter(
    ts_filter: str | None = None,
    /,
    *,
    columns: list[str] | None = None,
    materialize: bool = False
) -> FilteredFSDataset

Filter the dataset.

Parameter Name Description
ts_filter SQL WHERE-style filter.
columns Column subset.
materialize Materialize on server if True.

head

head(
    n: int = 10,
    *,
    as_arrow: bool = False
) -> pandas.DataFrame | pyarrow.Table

Return the first n rows.

Parameter Name Description
n Rows to return.
as_arrow Return PyArrow Table if True.

join

join(
    dataset: Dataset,
    join_condition: str | list[str],
    *,
    asof_join: bool = False,
    direction: str = 'forward',
    allow_exact_matches: bool = True
) -> Dataset

Join with another dataset.

Parameter Name Description
dataset Dataset to join.
join_condition Column list or SQL-style condition.
asof_join ASOF join on time columns.
direction Forward or backward.
allow_exact_matches Disallow exact matches if False.

np

np() -> numpy.ndarray

Alias for to_numpy.

to_arrow

to_arrow(
    materialize: bool = True
) -> pyarrow.Table | pyarrow.RecordBatchReader

Return the dataset as a PyArrow Table (materialized) or RecordBatchReader.

Parameter Name Description
materialize Return Table if True, otherwise RecordBatchReader.

to_numpy

to_numpy() -> numpy.ndarray

Return the dataset as a NumPy array (if possible).

to_pandas

to_pandas(
    use_nullable_dtypes: bool = False
) -> pandas.DataFrame

Return the dataset as a Pandas DataFrame.

Parameter Name Description
use_nullable_dtypes Use pandas nullable dtypes if True.

write_dataset

write_dataset(*args, **kwargs) -> None

Write the dataset to a file system using PyArrow write_dataset.

Parameter Name Description
*args Passed to pa.ds.write_dataset.
**kwargs Passed to pa.ds.write_dataset.

SnowflakeDataset

from tempora.datasets import SnowflakeDataset

Snowflake-backed dataset.

SnowflakeDataset(
    database: str,
    db_schema: str,
    table: str,
    time_column: str | None = None,
    entity_keys: list[str] | None = None,
    pivot: Pivot | None = None,
    materialize: bool = False,
    targets: bool = False
)

Parameters

Name Description
database Snowflake database name.
db_schema Snowflake schema name.
table Table name.
time_column Name of the time/sequence column (optional).
entity_keys Primary key column(s) identifying entities.
pivot Optional Pivot settings.
materialize Whether to materialize on the server.
targets Whether this dataset contains target data.

Properties

Name Description
schema PyArrow Schema for the dataset (fetched from server).
num_rows Number of rows in the dataset.

Methods

Name Description
df Alias for to_pandas.
drop Drop the dataset from the server.
filter Filter the dataset.
head Return the first n rows.
join Join with another dataset.
np Alias for to_numpy.
to_arrow Return the dataset as a PyArrow Table (materialized) or RecordBatchReader.
to_numpy Return the dataset as a NumPy array (if possible).
to_pandas Return the dataset as a Pandas DataFrame.
write_dataset Write the dataset to a file system using PyArrow write_dataset.

df

df(
    use_nullable_dtypes: bool = False
) -> pandas.DataFrame

Alias for to_pandas.

Parameter Name Description
use_nullable_dtypes Use pandas nullable dtypes if True.

drop

drop() -> None

Drop the dataset from the server.

filter

filter(
    ts_filter: str | None = None,
    /,
    *,
    columns: list[str] | None = None,
    materialize: bool = False
) -> FilteredSnowflakeDataset

Filter the dataset.

Parameter Name Description
ts_filter SQL WHERE-style filter.
columns Column subset.
materialize Materialize as a temporary table in Snowflake if True.

head

head(
    n: int = 10,
    *,
    as_arrow: bool = False
) -> pandas.DataFrame | pyarrow.Table

Return the first n rows.

Parameter Name Description
n Rows to return.
as_arrow Return PyArrow Table if True.

join

join(
    dataset: Dataset,
    join_condition: str | list[str],
    *,
    asof_join: bool = False,
    direction: str = 'forward',
    allow_exact_matches: bool = True
) -> Dataset

Join with another dataset.

Parameter Name Description
dataset Dataset to join.
join_condition Column list or SQL-style condition.
asof_join ASOF join on time columns.
direction Forward or backward.
allow_exact_matches Disallow exact matches if False.

np

np() -> numpy.ndarray

Alias for to_numpy.

to_arrow

to_arrow(
    materialize: bool = True
) -> pyarrow.Table | pyarrow.RecordBatchReader

Return the dataset as a PyArrow Table (materialized) or RecordBatchReader.

Parameter Name Description
materialize Return Table if True, otherwise RecordBatchReader.

to_numpy

to_numpy() -> numpy.ndarray

Return the dataset as a NumPy array (if possible).

to_pandas

to_pandas(
    use_nullable_dtypes: bool = False
) -> pandas.DataFrame

Return the dataset as a Pandas DataFrame.

Parameter Name Description
use_nullable_dtypes Use pandas nullable dtypes if True.

write_dataset

write_dataset(*args, **kwargs) -> None

Write the dataset to a file system using PyArrow write_dataset.

Parameter Name Description
*args Passed to pa.ds.write_dataset.
**kwargs Passed to pa.ds.write_dataset.

connect_to_snowflake

from tempora.datasets.snowflake import connect_to_snowflake

connect_to_snowflake(
    user: str | None = None,
    password: str | None = None,
    account: str | None = None,
    warehouse: str | None = None,
    session_parameters: dict[str, str] | None = None,
    connection_name: str | None = None
) -> None

Open a connection to Snowflake.

Parameter Name Description
user Login name. If not provided, uses SNOWFLAKE_USER.
password Password. If not provided, uses SNOWFLAKE_PASSWORD.
account Snowflake account identifier. If not provided, uses SNOWFLAKE_ACCOUNT.
warehouse Default warehouse name.
session_parameters Session-level parameters.
connection_name Name of a connection profile in connections.toml to load defaults from.

close_snowflake_connection

from tempora.datasets.snowflake import close_snowflake_connection

close_snowflake_connection() -> None

Close the currently open Snowflake connection.


Pivot

from tempora.datasets import Pivot

Pivot settings for datasets in EAV/long format.

Pivot(
    on: str,
    using: str,
    agg_function: str | dict[str, str],
    in_values: list[str] | None = None,
    dtype: str | np.dtype | dict[str, str | np.dtype] | None = None,
    errors: Literal['raise', 'coerce'] = 'coerce'
)

Parameters

Name Description
on Column whose values become new column names.
using Column providing values for the new columns.
agg_function Aggregate function (single function or per-column map).
in_values Restrict which values of on become columns.
dtype Target dtype(s) for new columns.
errors Error handling for dtype conversion.