Samplers

`RandomSampler`

from tempora.samplers import RandomSampler

Randomly samples batches from a dataset.

RandomSampler(
    context_len: int | float | dt.timedelta | np.timedelta64 | pd.Timedelta | Length,
    batch_size: int = 32,
    columns: list[str] | None = None,
    transform_spec: TransformSpec | None = None,
    target_spec: TargetSpec | None = None,
    class_sampling: list[ClassSamplingSpec] | None = None,
    output_format: OutputFormat = 'ndarray',
    as_tensor: bool = False,
    pad_value: int | float = np.nan,
    options: SamplerOptions = SamplerOptions()
)

Parameters

Name	Description
`context_len`	Length of each sampled context window (`Length` or time delta).
`batch_size`	Number of samples per batch.
`columns`	Optional feature columns to include.
`transform_spec`	Optional `TransformSpec` for context-window transforms.
`target_spec`	Optional Target Specification object.
`class_sampling`	Optional list of `ClassSamplingSpec` entries for weighted sampling of class-based targets. Each class spec defines a SQL expression that matches rows containing the target/label together with its sampling weight. If total class weights are less than `1`, the remainder is treated as an implicit background class.
`output_format`	Batch data format. Supported aliases: `'pytorch'` (`'pt'`), `'tensorflow'` (`'tf'`), `'jax'` (`'jx'`), `'numpy'` (`'np'`), `'pandas'` (`'df'`), `'pyarrow'` (`'pa'`).
`as_tensor`	Convert output to tensor format where applicable.
`pad_value`	Value used to pad variable-length sequences.
`options`	Sampler options (buffering, randomness, etc.).

Properties

Name	Description
`schema`	Data schema for the sampler output (available after sampling a dataset).
`targets_schema`	Target schema if a `target_spec` is configured.

Methods

Name	Description
`__call__`	Return an iterator over sampled `Batch` objects from a dataset.
`write_batches`	Write sampled batches to disk (local directory or cloud storage).

__call__

__call__(
    dataset: Dataset,
    num_batches: int | None = None,
    *,
    reset: bool = False
) -> Iterator[Batch]

Return an iterator over sampled Batch objects from a dataset.

Parameter Name	Description
`dataset`	Dataset to sample from.
`num_batches`	Optional maximum batches to yield, or None for all available.
`reset`	If True, reset the sampler's internal state before sampling.

write_batches

write_batches(
    dataset: Dataset,
    num_batches: int,
    path: str | Path,
    *,
    prefix: str = 'batch_',
    offset: int = 0,
    filesystem: fs.FileSystem | None = None,
    fs_config: dict[str, Any] | None = None,
    overwrite: bool = False,
    reset: bool = False
) -> None

Write sampled batches to disk (local directory or cloud storage).

Parameter Name	Description
`dataset`	Dataset to sample from.
`num_batches`	Number of batches to write.
`path`	Output directory path.
`prefix`	Filename prefix for each batch.
`offset`	Starting index for batch numbering.
`filesystem`	Optional PyArrow filesystem instance to write to.
`fs_config`	Optional filesystem configuration if `filesystem` is not provided.
`overwrite`	Overwrite existing files if True.
`reset`	If True, reset the sampler's internal state before sampling.

`SequentialSampler`

from tempora.samplers import SequentialSampler

Samples sequential batches from each entity/series.

SequentialSampler(
    context_len: int | float | dt.timedelta | np.timedelta64 | pd.Timedelta | Length,
    batch_size: int = 32,
    columns: list[str] | None = None,
    transform_spec: TransformSpec | None = None,
    target_spec: TargetSpec | None = None,
    output_format: OutputFormat = 'ndarray',
    as_tensor: bool = False,
    pad_value: int | float = np.nan,
    options: SamplerOptions = SamplerOptions(),
    random_start: bool = False,
    random_end: bool = False
)

Parameters

Name	Description
`context_len`	Length of each sampled context window (`Length` or time delta).
`batch_size`	Number of samples per batch.
`columns`	Optional feature columns to include.
`transform_spec`	Optional `TransformSpec` for context-window transforms.
`target_spec`	Optional Target Specification object.
`output_format`	Batch data format. Supported aliases: `'pytorch'` (`'pt'`), `'tensorflow'` (`'tf'`), `'jax'` (`'jx'`), `'numpy'` (`'np'`), `'pandas'` (`'df'`), `'pyarrow'` (`'pa'`).
`as_tensor`	Convert output to tensor format where applicable.
`pad_value`	Value used to pad variable-length sequences.
`options`	Sampler options (buffering, randomness, etc.).
`random_start`	Randomize the start offset for each series.
`random_end`	Randomize the end offset for each series.

Properties

Name	Description
`schema`	Data schema for the sampler output (available after sampling a dataset).
`targets_schema`	Target schema if a `target_spec` is configured.

Methods

Name	Description
`__call__`	Return an iterator over sampled `Batch` objects from a dataset.
`iter_series`	Iterate sampled `Batch` objects per series.
`write_batches`	Write sampled batches to disk (local directory or cloud storage).

__call__

__call__(
    dataset: Dataset,
    num_batches: int | None = None,
    *,
    reset: bool = False
) -> Iterator[Batch]

Return an iterator over sampled Batch objects from a dataset.

Parameter Name	Description
`dataset`	Dataset to sample from.
`num_batches`	Optional maximum batches to yield, or None for all available.
`reset`	If True, reset the sampler's internal state before sampling.

iter_series

iter_series(
    dataset: Dataset
) -> Iterator[Iterator[Batch]]

Iterate sampled Batch objects per series.

Parameter Name	Description
`dataset`	Dataset to sample from.

write_batches

write_batches(
    dataset: Dataset,
    num_batches: int,
    path: str | Path,
    *,
    prefix: str = 'batch_',
    offset: int = 0,
    filesystem: fs.FileSystem | None = None,
    fs_config: dict[str, Any] | None = None,
    overwrite: bool = False,
    reset: bool = False
) -> None

Write sampled batches to disk (local directory or cloud storage).

Parameter Name	Description
`dataset`	Dataset to sample from.
`num_batches`	Number of batches to write.
`path`	Output directory path.
`prefix`	Filename prefix for each batch.
`offset`	Starting index for batch numbering.
`filesystem`	Optional PyArrow filesystem instance to write to.
`fs_config`	Optional filesystem configuration if `filesystem` is not provided.
`overwrite`	Overwrite existing files if True.
`reset`	If True, reset the sampler's internal state before sampling.

`SeriesSampler`

from tempora.samplers import SeriesSampler

Samples batches from a dataset without a required context_len argument.

SeriesSampler(
    batch_size: int = 32,
    columns: list[str] | None = None,
    transform_spec: TransformSpec | None = None,
    target_spec: SeriesTargetSpec | None = None,
    output_format: OutputFormat = 'ndarray',
    as_tensor: bool = False,
    pad_value: int | float = np.nan,
    options: SamplerOptions = SamplerOptions()
)

Properties

Name	Description
`schema`	Data schema for the sampler output (available after sampling a dataset).
`targets_schema`	Target schema if a `target_spec` is configured.

Methods

Name	Description
`__call__`	Return an iterator over sampled `Batch` objects from a dataset.
`write_batches`	Write sampled batches to disk (local directory or cloud storage).

__call__

__call__(
    dataset: Dataset,
    num_batches: int | None = None,
    *,
    reset: bool = False
) -> Iterator[Batch]

Return an iterator over sampled Batch objects from a dataset.

Parameter Name	Description
`dataset`	Dataset to sample from.
`num_batches`	Optional maximum batches to yield, or None for all available.
`reset`	If True, reset the sampler's internal state before sampling.

write_batches

write_batches(
    dataset: Dataset,
    num_batches: int,
    path: str | Path,
    *,
    prefix: str = 'batch_',
    offset: int = 0,
    filesystem: fs.FileSystem | None = None,
    fs_config: dict[str, Any] | None = None,
    overwrite: bool = False,
    reset: bool = False
) -> None

Write sampled batches to disk (local directory or cloud storage).

Parameter Name	Description
`dataset`	Dataset to sample from.
`num_batches`	Number of batches to write.
`path`	Output directory path.
`prefix`	Filename prefix for each batch.
`offset`	Starting index for batch numbering.
`filesystem`	Optional PyArrow filesystem instance to write to.
`fs_config`	Optional filesystem configuration if `filesystem` is not provided.
`overwrite`	Overwrite existing files if True.
`reset`	If True, reset the sampler's internal state before sampling.

`SamplerOptions`

from tempora.samplers import SamplerOptions

SamplerOptions(
    use_table_cache: bool = True,
    incremental_table_update: bool = True,
    allow_partial_segments: bool = True,
    allow_null_entity_keys: bool = False,
    weight_series: Literal['duration', 'inverse_duration', 'num_rows', 'inverse_num_rows'] | None = None,
    left_censor_len: int | float | dt.timedelta | np.timedelta64 | pd.Timedelta | Length | None = None,
    right_censor_len: int | float | dt.timedelta | np.timedelta64 | pd.Timedelta | Length | None = None,
    align_on_data: bool = False,
    segments_per_query: int | None = None,
    segments_buffer_len: int | None = None,
    max_attempts: int = 1000,
    rng_seed: int = 201174
)

Options for batch samplers.

Parameter Name	Description
`use_table_cache`	Use the cached sampling table for the dataset, otherwise compute and cache a new table before sampling.
`incremental_table_update`	Build the sampling table incrementally during sampling instead of computing it upfront (only for datasets partitioned on `entity_keys`).
`allow_partial_segments`	Allow sampling segments shorter than the sampler context length (for example at series boundaries).
`allow_null_entity_keys`	Allow sampling segments where one or more entity key columns are null.
`weight_series`	Optional series-level sample weighting. Use `'duration'` / `'num_rows'` to sample in proportion to series duration (longer series yield more segments), `'inverse_duration'` / `'inverse_num_rows'` to favor shorter series, or `None` (default) for approximately equal contribution per series.
`left_censor_len`	Optional exclusion of the first `left_censor_len` of each time series from sampling.
`right_censor_len`	Optional exclusion of the last `right_censor_len` of each time series from sampling.
`align_on_data`	Align the start of each sampled segment to the nearest time point (useful for unevenly sampled data).
`segments_per_query`	Optional number of sampled segments to generate per SQL query.
`segments_buffer_len`	Optional number of sampled segments to buffer from the server.
`max_attempts`	Maximum attempts to successfully sample a segment before raising an exception.
`rng_seed`	RNG seed for the sampler.

`ClassSamplingSpec`

from tempora.samplers import ClassSamplingSpec

ClassSamplingSpec(
    name: str,
    expr: str,
    weight: float
)

Class-level sampling specification for classification targets.

Parameter Name	Description
`name`	Class identifier used for logging/debugging. Must be unique within `class_sampling`.
`expr`	SQL-compatible expression that matches all rows in the dataset containing the desired target/label.
`weight`	Normalized class sampling weight in `(0, 1]`.