Skip to content

Samplers

RandomSampler

from tempora.samplers import RandomSampler

Randomly samples batches from a dataset.

RandomSampler(
    context_len: int | float | dt.timedelta | np.timedelta64 | pd.Timedelta | Length,
    batch_size: int = 32,
    columns: list[str] | None = None,
    transform_spec: TransformSpec | None = None,
    target_spec: TargetSpec | None = None,
    class_sampling: list[ClassSamplingSpec] | None = None,
    output_format: OutputFormat = 'ndarray',
    as_tensor: bool = False,
    pad_value: int | float = np.nan,
    options: SamplerOptions = SamplerOptions()
)

Parameters

Name Description
context_len Length of each sampled context window (Length or time delta).
batch_size Number of samples per batch.
columns Optional feature columns to include.
transform_spec Optional TransformSpec for context-window transforms.
target_spec Optional Target Specification object.
class_sampling Optional list of ClassSamplingSpec entries for weighted sampling of class-based targets. Each class spec defines a SQL expression that matches rows containing the target/label together with its sampling weight. If total class weights are less than 1, the remainder is treated as an implicit background class.
output_format Batch data format. Supported aliases: 'pytorch' ('pt'), 'tensorflow' ('tf'), 'jax' ('jx'), 'numpy' ('np'), 'pandas' ('df'), 'pyarrow' ('pa').
as_tensor Convert output to tensor format where applicable.
pad_value Value used to pad variable-length sequences.
options Sampler options (buffering, randomness, etc.).

Properties

Name Description
schema Data schema for the sampler output (available after sampling a dataset).
targets_schema Target schema if a target_spec is configured.

Methods

Name Description
__call__ Return an iterator over sampled Batch objects from a dataset.
write_batches Write sampled batches to disk (local directory or cloud storage).

__call__

__call__(
    dataset: Dataset,
    num_batches: int | None = None,
    *,
    reset: bool = False
) -> Iterator[Batch]

Return an iterator over sampled Batch objects from a dataset.

Parameter Name Description
dataset Dataset to sample from.
num_batches Optional maximum batches to yield, or None for all available.
reset If True, reset the sampler's internal state before sampling.

write_batches

write_batches(
    dataset: Dataset,
    num_batches: int,
    path: str | Path,
    *,
    prefix: str = 'batch_',
    offset: int = 0,
    filesystem: fs.FileSystem | None = None,
    fs_config: dict[str, Any] | None = None,
    overwrite: bool = False,
    reset: bool = False
) -> None

Write sampled batches to disk (local directory or cloud storage).

Parameter Name Description
dataset Dataset to sample from.
num_batches Number of batches to write.
path Output directory path.
prefix Filename prefix for each batch.
offset Starting index for batch numbering.
filesystem Optional PyArrow filesystem instance to write to.
fs_config Optional filesystem configuration if filesystem is not provided.
overwrite Overwrite existing files if True.
reset If True, reset the sampler's internal state before sampling.

SequentialSampler

from tempora.samplers import SequentialSampler

Samples sequential batches from each entity/series.

SequentialSampler(
    context_len: int | float | dt.timedelta | np.timedelta64 | pd.Timedelta | Length,
    batch_size: int = 32,
    columns: list[str] | None = None,
    transform_spec: TransformSpec | None = None,
    target_spec: TargetSpec | None = None,
    output_format: OutputFormat = 'ndarray',
    as_tensor: bool = False,
    pad_value: int | float = np.nan,
    options: SamplerOptions = SamplerOptions(),
    random_start: bool = False,
    random_end: bool = False
)

Parameters

Name Description
context_len Length of each sampled context window (Length or time delta).
batch_size Number of samples per batch.
columns Optional feature columns to include.
transform_spec Optional TransformSpec for context-window transforms.
target_spec Optional Target Specification object.
output_format Batch data format. Supported aliases: 'pytorch' ('pt'), 'tensorflow' ('tf'), 'jax' ('jx'), 'numpy' ('np'), 'pandas' ('df'), 'pyarrow' ('pa').
as_tensor Convert output to tensor format where applicable.
pad_value Value used to pad variable-length sequences.
options Sampler options (buffering, randomness, etc.).
random_start Randomize the start offset for each series.
random_end Randomize the end offset for each series.

Properties

Name Description
schema Data schema for the sampler output (available after sampling a dataset).
targets_schema Target schema if a target_spec is configured.

Methods

Name Description
__call__ Return an iterator over sampled Batch objects from a dataset.
iter_series Iterate sampled Batch objects per series.
write_batches Write sampled batches to disk (local directory or cloud storage).

__call__

__call__(
    dataset: Dataset,
    num_batches: int | None = None,
    *,
    reset: bool = False
) -> Iterator[Batch]

Return an iterator over sampled Batch objects from a dataset.

Parameter Name Description
dataset Dataset to sample from.
num_batches Optional maximum batches to yield, or None for all available.
reset If True, reset the sampler's internal state before sampling.

iter_series

iter_series(
    dataset: Dataset
) -> Iterator[Iterator[Batch]]

Iterate sampled Batch objects per series.

Parameter Name Description
dataset Dataset to sample from.

write_batches

write_batches(
    dataset: Dataset,
    num_batches: int,
    path: str | Path,
    *,
    prefix: str = 'batch_',
    offset: int = 0,
    filesystem: fs.FileSystem | None = None,
    fs_config: dict[str, Any] | None = None,
    overwrite: bool = False,
    reset: bool = False
) -> None

Write sampled batches to disk (local directory or cloud storage).

Parameter Name Description
dataset Dataset to sample from.
num_batches Number of batches to write.
path Output directory path.
prefix Filename prefix for each batch.
offset Starting index for batch numbering.
filesystem Optional PyArrow filesystem instance to write to.
fs_config Optional filesystem configuration if filesystem is not provided.
overwrite Overwrite existing files if True.
reset If True, reset the sampler's internal state before sampling.

SeriesSampler

from tempora.samplers import SeriesSampler

Samples batches from a dataset without a required context_len argument.

SeriesSampler(
    batch_size: int = 32,
    columns: list[str] | None = None,
    transform_spec: TransformSpec | None = None,
    target_spec: SeriesTargetSpec | None = None,
    output_format: OutputFormat = 'ndarray',
    as_tensor: bool = False,
    pad_value: int | float = np.nan,
    options: SamplerOptions = SamplerOptions()
)

Properties

Name Description
schema Data schema for the sampler output (available after sampling a dataset).
targets_schema Target schema if a target_spec is configured.

Methods

Name Description
__call__ Return an iterator over sampled Batch objects from a dataset.
write_batches Write sampled batches to disk (local directory or cloud storage).

__call__

__call__(
    dataset: Dataset,
    num_batches: int | None = None,
    *,
    reset: bool = False
) -> Iterator[Batch]

Return an iterator over sampled Batch objects from a dataset.

Parameter Name Description
dataset Dataset to sample from.
num_batches Optional maximum batches to yield, or None for all available.
reset If True, reset the sampler's internal state before sampling.

write_batches

write_batches(
    dataset: Dataset,
    num_batches: int,
    path: str | Path,
    *,
    prefix: str = 'batch_',
    offset: int = 0,
    filesystem: fs.FileSystem | None = None,
    fs_config: dict[str, Any] | None = None,
    overwrite: bool = False,
    reset: bool = False
) -> None

Write sampled batches to disk (local directory or cloud storage).

Parameter Name Description
dataset Dataset to sample from.
num_batches Number of batches to write.
path Output directory path.
prefix Filename prefix for each batch.
offset Starting index for batch numbering.
filesystem Optional PyArrow filesystem instance to write to.
fs_config Optional filesystem configuration if filesystem is not provided.
overwrite Overwrite existing files if True.
reset If True, reset the sampler's internal state before sampling.

SamplerOptions

from tempora.samplers import SamplerOptions

SamplerOptions(
    use_table_cache: bool = True,
    incremental_table_update: bool = True,
    allow_partial_segments: bool = True,
    allow_null_entity_keys: bool = False,
    weight_series: Literal['duration', 'inverse_duration', 'num_rows', 'inverse_num_rows'] | None = None,
    left_censor_len: int | float | dt.timedelta | np.timedelta64 | pd.Timedelta | Length | None = None,
    right_censor_len: int | float | dt.timedelta | np.timedelta64 | pd.Timedelta | Length | None = None,
    align_on_data: bool = False,
    segments_per_query: int | None = None,
    segments_buffer_len: int | None = None,
    max_attempts: int = 1000,
    rng_seed: int = 201174
)

Options for batch samplers.

Parameter Name Description
use_table_cache Use the cached sampling table for the dataset, otherwise compute and cache a new table before sampling.
incremental_table_update Build the sampling table incrementally during sampling instead of computing it upfront (only for datasets partitioned on entity_keys).
allow_partial_segments Allow sampling segments shorter than the sampler context length (for example at series boundaries).
allow_null_entity_keys Allow sampling segments where one or more entity key columns are null.
weight_series Optional series-level sample weighting. Use 'duration' / 'num_rows' to sample in proportion to series duration (longer series yield more segments), 'inverse_duration' / 'inverse_num_rows' to favor shorter series, or None (default) for approximately equal contribution per series.
left_censor_len Optional exclusion of the first left_censor_len of each time series from sampling.
right_censor_len Optional exclusion of the last right_censor_len of each time series from sampling.
align_on_data Align the start of each sampled segment to the nearest time point (useful for unevenly sampled data).
segments_per_query Optional number of sampled segments to generate per SQL query.
segments_buffer_len Optional number of sampled segments to buffer from the server.
max_attempts Maximum attempts to successfully sample a segment before raising an exception.
rng_seed RNG seed for the sampler.

ClassSamplingSpec

from tempora.samplers import ClassSamplingSpec

ClassSamplingSpec(
    name: str,
    expr: str,
    weight: float
)

Class-level sampling specification for classification targets.

Parameter Name Description
name Class identifier used for logging/debugging. Must be unique within class_sampling.
expr SQL-compatible expression that matches all rows in the dataset containing the desired target/label.
weight Normalized class sampling weight in (0, 1].