Skip to content

Batch Objects

Batch

from tempora.utils.batch import Batch

Class for batches generated by a Sampler.

Batch(
    data: BatchData,
    seq_lens: SeqLens,
    feature_names: list[str],
    dtypes: list[Dtype],
    targets: BatchData | None = None,
    target_seq_lens: SeqLens | None = None,
    target_names: list[str] | None = None,
    target_dtypes: list[Dtype] | None = None,
    metadata: BatchMetadata | None = None
)

Parameters

Name Description
data Batch data either in tensor or (packed) matrix form.
seq_lens Length of each data sequence in the batch.
feature_names Names of the batch features/columns.
dtypes Batch data dtypes.
targets Optional batch targets.
target_seq_lens Length of each target sequence.
target_names Names of the target features/columns.
target_dtypes Target dtypes.
metadata Batch metadata.

Properties

Name Description
columns Alias of feature_names.
feature_size Number of batch features/columns.
is_numeric True if all batch features/columns are numeric.
is_packed True if the batch is in packed form, otherwise False.
pad_mask Boolean mask of padded values for tensor (unpacked) batches.

Methods

Name Description
as_format Convert the batch to a different format.
equals Test whether the batch is equal to another batch.
from_parquet Deserialize a batch object from a parquet file.
from_segments Create a batch from a list of segments.
iter_packed Iterate over data examples/sequences for a packed batch.
read_parquet Alias of from_parquet.
select Create a batch with only selected features/columns.
select_numeric Create a batch with numeric features only.
to_parquet Serialize the batch object to a parquet file.
to_tensor Alias for unpack().
unpack Convert the batch to tensor form.

as_format

as_format(
    output_format: OutputFormat,
    pin_memory: bool = False
) -> Batch

Convert the batch to a different format.

Parameter Name Description
output_format Batch data format. Supported aliases: 'pytorch' ('pt'), 'tensorflow' ('tf'), 'jax' ('jx'), 'numpy' ('np'), 'pandas' ('df'), 'pyarrow' ('pa').
pin_memory For PyTorch tensors, use page-locked CPU memory to speed up GPU transfer.

equals

equals(
    other: Batch,
    *,
    check_dtype_precision: bool = True,
    check_metadata: bool = True,
    verbose: bool = True
) -> bool

Test whether the batch is equal to another batch.

Parameter Name Description
other Batch to compare with.
check_dtype_precision Include dtype precision in the comparison.
check_metadata Include metadata in the comparison.
verbose Print details when batches differ.

from_parquet

from_parquet(
    filepath: str | Path,
    filesystem: fs.FileSystem | None = None,
    fs_config: dict[str, Any] | None = None,
    as_tensor: bool = False,
    pad_value: int | float = np.nan,
    output_format: OutputFormat | None = None,
    pin_memory: bool = False
) -> Batch

Deserialize a batch object from a parquet file.

Parameter Name Description
filepath Path to the serialized batch parquet file.
filesystem Optional PyArrow filesystem to read from.
fs_config Filesystem configuration if filesystem is not provided.
as_tensor Return tensor form if True, otherwise packed matrix form.
pad_value Tensor padding value for uneven sequence lengths.
output_format Batch data format. Supported aliases: 'pytorch' ('pt'), 'tensorflow' ('tf'), 'jax' ('jx'), 'numpy' ('np'), 'pandas' ('df'), 'pyarrow' ('pa').
pin_memory For PyTorch tensors, use page-locked CPU memory to speed up GPU transfer.

from_segments

from_segments(
    segments: list[TSSegment],
    as_tensor: bool = False,
    pad_value: int | float = np.nan,
    output_format: OutputFormat | None = None,
    pin_memory: bool = False
) -> Batch

Create a batch from a list of segments.

Parameter Name Description
segments List of segments to combine into a batch.
as_tensor Return tensor form if True, otherwise packed matrix form.
pad_value Tensor padding value for uneven sequence lengths.
output_format Batch data format. Supported aliases: 'pytorch' ('pt'), 'tensorflow' ('tf'), 'jax' ('jx'), 'numpy' ('np'), 'pandas' ('df'), 'pyarrow' ('pa').
pin_memory For PyTorch tensors, use page-locked CPU memory to speed up GPU transfer.

iter_packed

iter_packed(
    seq_lens: SeqLens | None = None
) -> Iterator[slice]

Iterate over batch data examples/sequences for a batch in packed form.

Parameter Name Description
seq_lens Optional sequence lengths to iterate over (defaults to self.seq_lens).

read_parquet

read_parquet(
    filepath: str | Path,
    filesystem: fs.FileSystem | None = None,
    fs_config: dict[str, Any] | None = None,
    as_tensor: bool = False,
    pad_value: int | float = np.nan,
    output_format: OutputFormat | None = None,
    pin_memory: bool = False
) -> Batch

Alias of from_parquet (for compatibility with the Pandas Dataframe API).

Parameter Name Description
filepath Path to the serialized batch parquet file.
filesystem Optional PyArrow filesystem to read from.
fs_config Filesystem configuration if filesystem is not provided.
as_tensor Return tensor form if True, otherwise packed matrix form.
pad_value Tensor padding value for uneven sequence lengths.
output_format Batch data format. Supported aliases: 'pytorch' ('pt'), 'tensorflow' ('tf'), 'jax' ('jx'), 'numpy' ('np'), 'pandas' ('df'), 'pyarrow' ('pa').
pin_memory For PyTorch tensors, use page-locked CPU memory to speed up GPU transfer.

select

select(
    features: list[str]
) -> Batch

Create a new batch object consisting of only the desired features/columns.

Parameter Name Description
features Features/columns to keep.

select_numeric

select_numeric() -> Batch

Create a new batch object consisting of numeric features only.

to_parquet

to_parquet(
    filepath: str | Path,
    filesystem: fs.FileSystem | None = None,
    fs_config: dict[str, Any] | None = None,
    overwrite: bool = False
) -> None

Serialize the batch object to a parquet file.

Parameter Name Description
filepath Path to the parquet file to create.
filesystem Optional PyArrow filesystem instance to write to.
fs_config Filesystem configuration if filesystem is not provided.
overwrite If True, overwrite an existing file with the same name.

to_tensor

to_tensor(
    pad_value: int | float = np.nan
) -> Batch

Alias for unpack().

Parameter Name Description
pad_value Tensor padding value for uneven sequence lengths.

unpack

unpack(
    pad_value: int | float = np.nan
) -> Batch

Convert the batch to tensor form.

Parameter Name Description
pad_value Tensor padding value for uneven sequence lengths.

read_batches

from tempora.utils.batch import read_batches

read_batches(
    path: str | Path,
    *,
    filesystem: fs.FileSystem | None = None,
    fs_config: dict[str, Any] | None = None,
    as_tensor: bool = False,
    pad_value: int | float = np.nan,
    output_format: OutputFormat | None = None,
    pin_memory: bool = False
) -> list[Batch]

Deserialize batches from a directory of parquet files.

Parameter Name Description
path Directory path to serialized batch files.
filesystem Optional PyArrow filesystem to read from.
fs_config Filesystem configuration if filesystem is not provided.
as_tensor Return tensor form if True, otherwise packed matrix form.
pad_value Tensor padding value for uneven sequence lengths.
output_format Batch data format. Supported aliases: 'pytorch' ('pt'), 'tensorflow' ('tf'), 'jax' ('jx'), 'numpy' ('np'), 'pandas' ('df'), 'pyarrow' ('pa').
pin_memory For PyTorch tensors, use page-locked CPU memory to speed up GPU transfer.