Datasets
FSDataset
from tempora.datasets import FSDataset
File system-backed dataset.
FSDataset(
source: str | Path | list[str | Path],
file_format: Literal['parquet', 'csv', 'json', 'feather', 'orc'] | None = None,
file_schema: pa.Schema | None = None,
columns: list[str] | None = None,
read_options: ReadOptions | None = None,
parse_options: ParseOptions | None = None,
convert_options: ConvertOptions | None = None,
schema_inf_depth: float | None = None,
partitioning: Literal['hive'] | Partitioning | None = None,
fs_config: dict[str, Any] | None = None,
dataset_name: str | None = None,
time_column: str | None = None,
entity_keys: list[str] | None = None,
pivot: Pivot | None = None,
materialize: bool = False,
targets: bool = False
)
Parameters
| Name | Description |
|---|---|
source |
File path(s). Use s3://, gcs://, hdfs:// for remote storage. Use server: prefix for datasets stored on the Tempora server. |
file_format |
File format or None to infer from extensions. |
file_schema |
PyArrow Schema for the dataset. |
columns |
Column subset to load. |
read_options |
Read Options object. |
parse_options |
Parse Options object. |
convert_options |
Convert Options object. |
schema_inf_depth |
MiB of data to use for schema inference (default 8 MiB). |
partitioning |
Partitioning configuration ('hive' or Partitioning Options). |
fs_config |
File system configuration passed to PyArrow LocalFileSystem, GcsFileSystem, or HadoopFileSystem. |
dataset_name |
Optional dataset name. |
time_column |
Name of the time/sequence column (optional). |
entity_keys |
Primary key column(s) identifying entities. |
pivot |
Optional Pivot settings. |
materialize |
Whether to materialize on the server. |
targets |
Whether this dataset contains target data. |
Properties
| Name | Description |
|---|---|
schema |
PyArrow Schema for the dataset (fetched from server). |
num_rows |
Number of rows in the dataset. |
fs_type |
File system type for the dataset ('server', 's3', 'gcs', 'hdfs', 'local'). |
| Name | Description |
|---|---|
df |
Alias for to_pandas. |
drop |
Drop the dataset from the server. |
filter |
Filter the dataset. |
head |
Return the first n rows. |
join |
Join with another dataset. |
np |
Alias for to_numpy. |
to_arrow |
Return the dataset as a PyArrow Table (materialized) or RecordBatchReader. |
to_numpy |
Return the dataset as a NumPy array (if possible). |
to_pandas |
Return the dataset as a Pandas DataFrame. |
write_dataset |
Write the dataset to a file system using PyArrow write_dataset. |
df(
use_nullable_dtypes: bool = False
) -> pandas.DataFrame
Alias for to_pandas.
| Parameter Name | Description |
|---|---|
use_nullable_dtypes |
Use pandas nullable dtypes if True. |
drop() -> None
Drop the dataset from the server.
filter(
ts_filter: str | None = None,
/,
*,
columns: list[str] | None = None,
materialize: bool = False
) -> FilteredFSDataset
Filter the dataset.
| Parameter Name | Description |
|---|---|
ts_filter |
SQL WHERE-style filter. |
columns |
Column subset. |
materialize |
Materialize on server if True. |
head(
n: int = 10,
*,
as_arrow: bool = False
) -> pandas.DataFrame | pyarrow.Table
Return the first n rows.
| Parameter Name | Description |
|---|---|
n |
Rows to return. |
as_arrow |
Return PyArrow Table if True. |
join(
dataset: Dataset,
join_condition: str | list[str],
*,
asof_join: bool = False,
direction: str = 'forward',
allow_exact_matches: bool = True
) -> Dataset
Join with another dataset.
| Parameter Name | Description |
|---|---|
dataset |
Dataset to join. |
join_condition |
Column list or SQL-style condition. |
asof_join |
ASOF join on time columns. |
direction |
Forward or backward. |
allow_exact_matches |
Disallow exact matches if False. |
np() -> numpy.ndarray
Alias for to_numpy.
to_arrow(
materialize: bool = True
) -> pyarrow.Table | pyarrow.RecordBatchReader
Return the dataset as a PyArrow Table (materialized) or RecordBatchReader.
| Parameter Name | Description |
|---|---|
materialize |
Return Table if True, otherwise RecordBatchReader. |
to_numpy() -> numpy.ndarray
Return the dataset as a NumPy array (if possible).
to_pandas(
use_nullable_dtypes: bool = False
) -> pandas.DataFrame
Return the dataset as a Pandas DataFrame.
| Parameter Name | Description |
|---|---|
use_nullable_dtypes |
Use pandas nullable dtypes if True. |
write_dataset(*args, **kwargs) -> None
Write the dataset to a file system using PyArrow write_dataset.
| Parameter Name | Description |
|---|---|
*args |
Passed to pa.ds.write_dataset. |
**kwargs |
Passed to pa.ds.write_dataset. |
SnowflakeDataset
from tempora.datasets import SnowflakeDataset
Snowflake-backed dataset.
SnowflakeDataset(
database: str,
db_schema: str,
table: str,
time_column: str | None = None,
entity_keys: list[str] | None = None,
pivot: Pivot | None = None,
materialize: bool = False,
targets: bool = False
)
Parameters
| Name | Description |
|---|---|
database |
Snowflake database name. |
db_schema |
Snowflake schema name. |
table |
Table name. |
time_column |
Name of the time/sequence column (optional). |
entity_keys |
Primary key column(s) identifying entities. |
pivot |
Optional Pivot settings. |
materialize |
Whether to materialize on the server. |
targets |
Whether this dataset contains target data. |
Properties
| Name | Description |
|---|---|
schema |
PyArrow Schema for the dataset (fetched from server). |
num_rows |
Number of rows in the dataset. |
| Name | Description |
|---|---|
df |
Alias for to_pandas. |
drop |
Drop the dataset from the server. |
filter |
Filter the dataset. |
head |
Return the first n rows. |
join |
Join with another dataset. |
np |
Alias for to_numpy. |
to_arrow |
Return the dataset as a PyArrow Table (materialized) or RecordBatchReader. |
to_numpy |
Return the dataset as a NumPy array (if possible). |
to_pandas |
Return the dataset as a Pandas DataFrame. |
write_dataset |
Write the dataset to a file system using PyArrow write_dataset. |
df(
use_nullable_dtypes: bool = False
) -> pandas.DataFrame
Alias for to_pandas.
| Parameter Name | Description |
|---|---|
use_nullable_dtypes |
Use pandas nullable dtypes if True. |
drop() -> None
Drop the dataset from the server.
filter(
ts_filter: str | None = None,
/,
*,
columns: list[str] | None = None,
materialize: bool = False
) -> FilteredSnowflakeDataset
Filter the dataset.
| Parameter Name | Description |
|---|---|
ts_filter |
SQL WHERE-style filter. |
columns |
Column subset. |
materialize |
Materialize as a temporary table in Snowflake if True. |
head(
n: int = 10,
*,
as_arrow: bool = False
) -> pandas.DataFrame | pyarrow.Table
Return the first n rows.
| Parameter Name | Description |
|---|---|
n |
Rows to return. |
as_arrow |
Return PyArrow Table if True. |
join(
dataset: Dataset,
join_condition: str | list[str],
*,
asof_join: bool = False,
direction: str = 'forward',
allow_exact_matches: bool = True
) -> Dataset
Join with another dataset.
| Parameter Name | Description |
|---|---|
dataset |
Dataset to join. |
join_condition |
Column list or SQL-style condition. |
asof_join |
ASOF join on time columns. |
direction |
Forward or backward. |
allow_exact_matches |
Disallow exact matches if False. |
np() -> numpy.ndarray
Alias for to_numpy.
to_arrow(
materialize: bool = True
) -> pyarrow.Table | pyarrow.RecordBatchReader
Return the dataset as a PyArrow Table (materialized) or RecordBatchReader.
| Parameter Name | Description |
|---|---|
materialize |
Return Table if True, otherwise RecordBatchReader. |
to_numpy() -> numpy.ndarray
Return the dataset as a NumPy array (if possible).
to_pandas(
use_nullable_dtypes: bool = False
) -> pandas.DataFrame
Return the dataset as a Pandas DataFrame.
| Parameter Name | Description |
|---|---|
use_nullable_dtypes |
Use pandas nullable dtypes if True. |
write_dataset(*args, **kwargs) -> None
Write the dataset to a file system using PyArrow write_dataset.
| Parameter Name | Description |
|---|---|
*args |
Passed to pa.ds.write_dataset. |
**kwargs |
Passed to pa.ds.write_dataset. |
connect_to_snowflake
from tempora.datasets.snowflake import connect_to_snowflake
connect_to_snowflake(
user: str | None = None,
password: str | None = None,
account: str | None = None,
warehouse: str | None = None,
session_parameters: dict[str, str] | None = None,
connection_name: str | None = None
) -> None
Open a connection to Snowflake.
| Parameter Name | Description |
|---|---|
user |
Login name. If not provided, uses SNOWFLAKE_USER. |
password |
Password. If not provided, uses SNOWFLAKE_PASSWORD. |
account |
Snowflake account identifier. If not provided, uses SNOWFLAKE_ACCOUNT. |
warehouse |
Default warehouse name. |
session_parameters |
Session-level parameters. |
connection_name |
Name of a connection profile in connections.toml to load defaults from. |
close_snowflake_connection
from tempora.datasets.snowflake import close_snowflake_connection
close_snowflake_connection() -> None
Close the currently open Snowflake connection.
Pivot
from tempora.datasets import Pivot
Pivot settings for datasets in EAV/long format.
Pivot(
on: str,
using: str,
agg_function: str | dict[str, str],
in_values: list[str] | None = None,
dtype: str | np.dtype | dict[str, str | np.dtype] | None = None,
errors: Literal['raise', 'coerce'] = 'coerce'
)
Parameters
| Name | Description |
|---|---|
on |
Column whose values become new column names. |
using |
Column providing values for the new columns. |
agg_function |
Aggregate function (single function or per-column map). |
in_values |
Restrict which values of on become columns. |
dtype |
Target dtype(s) for new columns. |
errors |
Error handling for dtype conversion. |