Samplers
RandomSampler
from tempora.samplers import RandomSampler
Randomly samples batches from a dataset.
RandomSampler(
context_len: int | float | dt.timedelta | np.timedelta64 | pd.Timedelta | Length,
batch_size: int = 32,
columns: list[str] | None = None,
transform_spec: TransformSpec | None = None,
target_spec: TargetSpec | None = None,
class_sampling: list[ClassSamplingSpec] | None = None,
output_format: OutputFormat = 'ndarray',
as_tensor: bool = False,
pad_value: int | float = np.nan,
options: SamplerOptions = SamplerOptions()
)
Parameters
| Name | Description |
|---|---|
context_len |
Length of each sampled context window (Length or time delta). |
batch_size |
Number of samples per batch. |
columns |
Optional feature columns to include. |
transform_spec |
Optional TransformSpec for context-window transforms. |
target_spec |
Optional Target Specification object. |
class_sampling |
Optional list of ClassSamplingSpec entries for weighted sampling of class-based targets. Each class spec defines a SQL expression that matches rows containing the target/label together with its sampling weight. If total class weights are less than 1, the remainder is treated as an implicit background class. |
output_format |
Batch data format. Supported aliases: 'pytorch' ('pt'), 'tensorflow' ('tf'), 'jax' ('jx'), 'numpy' ('np'), 'pandas' ('df'), 'pyarrow' ('pa'). |
as_tensor |
Convert output to tensor format where applicable. |
pad_value |
Value used to pad variable-length sequences. |
options |
Sampler options (buffering, randomness, etc.). |
Properties
| Name | Description |
|---|---|
schema |
Data schema for the sampler output (available after sampling a dataset). |
targets_schema |
Target schema if a target_spec is configured. |
| Name | Description |
|---|---|
__call__ |
Return an iterator over sampled Batch objects from a dataset. |
write_batches |
Write sampled batches to disk (local directory or cloud storage). |
__call__(
dataset: Dataset,
num_batches: int | None = None,
*,
reset: bool = False
) -> Iterator[Batch]
Return an iterator over sampled Batch objects from a dataset.
| Parameter Name | Description |
|---|---|
dataset |
Dataset to sample from. |
num_batches |
Optional maximum batches to yield, or None for all available. |
reset |
If True, reset the sampler's internal state before sampling. |
write_batches(
dataset: Dataset,
num_batches: int,
path: str | Path,
*,
prefix: str = 'batch_',
offset: int = 0,
filesystem: fs.FileSystem | None = None,
fs_config: dict[str, Any] | None = None,
overwrite: bool = False,
reset: bool = False
) -> None
Write sampled batches to disk (local directory or cloud storage).
| Parameter Name | Description |
|---|---|
dataset |
Dataset to sample from. |
num_batches |
Number of batches to write. |
path |
Output directory path. |
prefix |
Filename prefix for each batch. |
offset |
Starting index for batch numbering. |
filesystem |
Optional PyArrow filesystem instance to write to. |
fs_config |
Optional filesystem configuration if filesystem is not provided. |
overwrite |
Overwrite existing files if True. |
reset |
If True, reset the sampler's internal state before sampling. |
SequentialSampler
from tempora.samplers import SequentialSampler
Samples sequential batches from each entity/series.
SequentialSampler(
context_len: int | float | dt.timedelta | np.timedelta64 | pd.Timedelta | Length,
batch_size: int = 32,
columns: list[str] | None = None,
transform_spec: TransformSpec | None = None,
target_spec: TargetSpec | None = None,
output_format: OutputFormat = 'ndarray',
as_tensor: bool = False,
pad_value: int | float = np.nan,
options: SamplerOptions = SamplerOptions(),
random_start: bool = False,
random_end: bool = False
)
Parameters
| Name | Description |
|---|---|
context_len |
Length of each sampled context window (Length or time delta). |
batch_size |
Number of samples per batch. |
columns |
Optional feature columns to include. |
transform_spec |
Optional TransformSpec for context-window transforms. |
target_spec |
Optional Target Specification object. |
output_format |
Batch data format. Supported aliases: 'pytorch' ('pt'), 'tensorflow' ('tf'), 'jax' ('jx'), 'numpy' ('np'), 'pandas' ('df'), 'pyarrow' ('pa'). |
as_tensor |
Convert output to tensor format where applicable. |
pad_value |
Value used to pad variable-length sequences. |
options |
Sampler options (buffering, randomness, etc.). |
random_start |
Randomize the start offset for each series. |
random_end |
Randomize the end offset for each series. |
Properties
| Name | Description |
|---|---|
schema |
Data schema for the sampler output (available after sampling a dataset). |
targets_schema |
Target schema if a target_spec is configured. |
| Name | Description |
|---|---|
__call__ |
Return an iterator over sampled Batch objects from a dataset. |
iter_series |
Iterate sampled Batch objects per series. |
write_batches |
Write sampled batches to disk (local directory or cloud storage). |
__call__(
dataset: Dataset,
num_batches: int | None = None,
*,
reset: bool = False
) -> Iterator[Batch]
Return an iterator over sampled Batch objects from a dataset.
| Parameter Name | Description |
|---|---|
dataset |
Dataset to sample from. |
num_batches |
Optional maximum batches to yield, or None for all available. |
reset |
If True, reset the sampler's internal state before sampling. |
iter_series(
dataset: Dataset
) -> Iterator[Iterator[Batch]]
Iterate sampled Batch objects per series.
| Parameter Name | Description |
|---|---|
dataset |
Dataset to sample from. |
write_batches(
dataset: Dataset,
num_batches: int,
path: str | Path,
*,
prefix: str = 'batch_',
offset: int = 0,
filesystem: fs.FileSystem | None = None,
fs_config: dict[str, Any] | None = None,
overwrite: bool = False,
reset: bool = False
) -> None
Write sampled batches to disk (local directory or cloud storage).
| Parameter Name | Description |
|---|---|
dataset |
Dataset to sample from. |
num_batches |
Number of batches to write. |
path |
Output directory path. |
prefix |
Filename prefix for each batch. |
offset |
Starting index for batch numbering. |
filesystem |
Optional PyArrow filesystem instance to write to. |
fs_config |
Optional filesystem configuration if filesystem is not provided. |
overwrite |
Overwrite existing files if True. |
reset |
If True, reset the sampler's internal state before sampling. |
SeriesSampler
from tempora.samplers import SeriesSampler
Samples batches from a dataset without a required context_len argument.
SeriesSampler(
batch_size: int = 32,
columns: list[str] | None = None,
transform_spec: TransformSpec | None = None,
target_spec: SeriesTargetSpec | None = None,
output_format: OutputFormat = 'ndarray',
as_tensor: bool = False,
pad_value: int | float = np.nan,
options: SamplerOptions = SamplerOptions()
)
Properties
| Name | Description |
|---|---|
schema |
Data schema for the sampler output (available after sampling a dataset). |
targets_schema |
Target schema if a target_spec is configured. |
| Name | Description |
|---|---|
__call__ |
Return an iterator over sampled Batch objects from a dataset. |
write_batches |
Write sampled batches to disk (local directory or cloud storage). |
__call__(
dataset: Dataset,
num_batches: int | None = None,
*,
reset: bool = False
) -> Iterator[Batch]
Return an iterator over sampled Batch objects from a dataset.
| Parameter Name | Description |
|---|---|
dataset |
Dataset to sample from. |
num_batches |
Optional maximum batches to yield, or None for all available. |
reset |
If True, reset the sampler's internal state before sampling. |
write_batches(
dataset: Dataset,
num_batches: int,
path: str | Path,
*,
prefix: str = 'batch_',
offset: int = 0,
filesystem: fs.FileSystem | None = None,
fs_config: dict[str, Any] | None = None,
overwrite: bool = False,
reset: bool = False
) -> None
Write sampled batches to disk (local directory or cloud storage).
| Parameter Name | Description |
|---|---|
dataset |
Dataset to sample from. |
num_batches |
Number of batches to write. |
path |
Output directory path. |
prefix |
Filename prefix for each batch. |
offset |
Starting index for batch numbering. |
filesystem |
Optional PyArrow filesystem instance to write to. |
fs_config |
Optional filesystem configuration if filesystem is not provided. |
overwrite |
Overwrite existing files if True. |
reset |
If True, reset the sampler's internal state before sampling. |
SamplerOptions
from tempora.samplers import SamplerOptions
SamplerOptions(
use_table_cache: bool = True,
incremental_table_update: bool = True,
allow_partial_segments: bool = True,
allow_null_entity_keys: bool = False,
weight_series: Literal['duration', 'inverse_duration', 'num_rows', 'inverse_num_rows'] | None = None,
left_censor_len: int | float | dt.timedelta | np.timedelta64 | pd.Timedelta | Length | None = None,
right_censor_len: int | float | dt.timedelta | np.timedelta64 | pd.Timedelta | Length | None = None,
align_on_data: bool = False,
segments_per_query: int | None = None,
segments_buffer_len: int | None = None,
max_attempts: int = 1000,
rng_seed: int = 201174
)
Options for batch samplers.
| Parameter Name | Description |
|---|---|
use_table_cache |
Use the cached sampling table for the dataset, otherwise compute and cache a new table before sampling. |
incremental_table_update |
Build the sampling table incrementally during sampling instead of computing it upfront (only for datasets partitioned on entity_keys). |
allow_partial_segments |
Allow sampling segments shorter than the sampler context length (for example at series boundaries). |
allow_null_entity_keys |
Allow sampling segments where one or more entity key columns are null. |
weight_series |
Optional series-level sample weighting. Use 'duration' / 'num_rows' to sample in proportion to series duration (longer series yield more segments), 'inverse_duration' / 'inverse_num_rows' to favor shorter series, or None (default) for approximately equal contribution per series. |
left_censor_len |
Optional exclusion of the first left_censor_len of each time series from sampling. |
right_censor_len |
Optional exclusion of the last right_censor_len of each time series from sampling. |
align_on_data |
Align the start of each sampled segment to the nearest time point (useful for unevenly sampled data). |
segments_per_query |
Optional number of sampled segments to generate per SQL query. |
segments_buffer_len |
Optional number of sampled segments to buffer from the server. |
max_attempts |
Maximum attempts to successfully sample a segment before raising an exception. |
rng_seed |
RNG seed for the sampler. |
ClassSamplingSpec
from tempora.samplers import ClassSamplingSpec
ClassSamplingSpec(
name: str,
expr: str,
weight: float
)
Class-level sampling specification for classification targets.
| Parameter Name | Description |
|---|---|
name |
Class identifier used for logging/debugging. Must be unique within class_sampling. |
expr |
SQL-compatible expression that matches all rows in the dataset containing the desired target/label. |
weight |
Normalized class sampling weight in (0, 1]. |