Batch Objects

`Batch`

from tempora.utils.batch import Batch

Class for batches generated by a Sampler.

Batch(
    data: BatchData,
    seq_lens: SeqLens,
    feature_names: list[str],
    dtypes: list[Dtype],
    targets: BatchData | None = None,
    target_seq_lens: SeqLens | None = None,
    target_names: list[str] | None = None,
    target_dtypes: list[Dtype] | None = None,
    metadata: BatchMetadata | None = None
)

Parameters

Name	Description
`data`	Batch data either in tensor or (packed) matrix form.
`seq_lens`	Length of each data sequence in the batch.
`feature_names`	Names of the batch features/columns.
`dtypes`	Batch data dtypes.
`targets`	Optional batch targets.
`target_seq_lens`	Length of each target sequence.
`target_names`	Names of the target features/columns.
`target_dtypes`	Target dtypes.
`metadata`	Batch metadata.

Properties

Name	Description
`columns`	Alias of `feature_names`.
`feature_size`	Number of batch features/columns.
`is_numeric`	`True` if all batch features/columns are numeric.
`is_packed`	`True` if the batch is in packed form, otherwise `False`.
`pad_mask`	Boolean mask of padded values for tensor (unpacked) batches.

Methods

Name	Description
`as_format`	Convert the batch to a different format.
`equals`	Test whether the batch is equal to another batch.
`from_parquet`	Deserialize a batch object from a parquet file.
`from_segments`	Create a batch from a list of segments.
`iter_packed`	Iterate over data examples/sequences for a packed batch.
`read_parquet`	Alias of `from_parquet`.
`select`	Create a batch with only selected features/columns.
`select_numeric`	Create a batch with numeric features only.
`to_parquet`	Serialize the batch object to a parquet file.
`to_tensor`	Alias for `unpack()`.
`unpack`	Convert the batch to tensor form.

as_format

as_format(
    output_format: OutputFormat,
    pin_memory: bool = False
) -> Batch

Convert the batch to a different format.

Parameter Name	Description
`output_format`	Batch data format. Supported aliases: `'pytorch'` (`'pt'`), `'tensorflow'` (`'tf'`), `'jax'` (`'jx'`), `'numpy'` (`'np'`), `'pandas'` (`'df'`), `'pyarrow'` (`'pa'`).
`pin_memory`	For PyTorch tensors, use page-locked CPU memory to speed up GPU transfer.

equals

equals(
    other: Batch,
    *,
    check_dtype_precision: bool = True,
    check_metadata: bool = True,
    verbose: bool = True
) -> bool

Test whether the batch is equal to another batch.

Parameter Name	Description
`other`	Batch to compare with.
`check_dtype_precision`	Include dtype precision in the comparison.
`check_metadata`	Include metadata in the comparison.
`verbose`	Print details when batches differ.

from_parquet

from_parquet(
    filepath: str | Path,
    filesystem: fs.FileSystem | None = None,
    fs_config: dict[str, Any] | None = None,
    as_tensor: bool = False,
    pad_value: int | float = np.nan,
    output_format: OutputFormat | None = None,
    pin_memory: bool = False
) -> Batch

Deserialize a batch object from a parquet file.

Parameter Name	Description
`filepath`	Path to the serialized batch parquet file.
`filesystem`	Optional PyArrow filesystem to read from.
`fs_config`	Filesystem configuration if `filesystem` is not provided.
`as_tensor`	Return tensor form if `True`, otherwise packed matrix form.
`pad_value`	Tensor padding value for uneven sequence lengths.
`output_format`	Batch data format. Supported aliases: `'pytorch'` (`'pt'`), `'tensorflow'` (`'tf'`), `'jax'` (`'jx'`), `'numpy'` (`'np'`), `'pandas'` (`'df'`), `'pyarrow'` (`'pa'`).
`pin_memory`	For PyTorch tensors, use page-locked CPU memory to speed up GPU transfer.

from_segments

from_segments(
    segments: list[TSSegment],
    as_tensor: bool = False,
    pad_value: int | float = np.nan,
    output_format: OutputFormat | None = None,
    pin_memory: bool = False
) -> Batch

Create a batch from a list of segments.

Parameter Name	Description
`segments`	List of segments to combine into a batch.
`as_tensor`	Return tensor form if `True`, otherwise packed matrix form.
`pad_value`	Tensor padding value for uneven sequence lengths.
`output_format`	Batch data format. Supported aliases: `'pytorch'` (`'pt'`), `'tensorflow'` (`'tf'`), `'jax'` (`'jx'`), `'numpy'` (`'np'`), `'pandas'` (`'df'`), `'pyarrow'` (`'pa'`).
`pin_memory`	For PyTorch tensors, use page-locked CPU memory to speed up GPU transfer.

iter_packed

iter_packed(
    seq_lens: SeqLens | None = None
) -> Iterator[slice]

Iterate over batch data examples/sequences for a batch in packed form.

Parameter Name	Description
`seq_lens`	Optional sequence lengths to iterate over (defaults to `self.seq_lens`).

read_parquet

read_parquet(
    filepath: str | Path,
    filesystem: fs.FileSystem | None = None,
    fs_config: dict[str, Any] | None = None,
    as_tensor: bool = False,
    pad_value: int | float = np.nan,
    output_format: OutputFormat | None = None,
    pin_memory: bool = False
) -> Batch

Alias of from_parquet (for compatibility with the Pandas Dataframe API).

Parameter Name	Description
`filepath`	Path to the serialized batch parquet file.
`filesystem`	Optional PyArrow filesystem to read from.
`fs_config`	Filesystem configuration if `filesystem` is not provided.
`as_tensor`	Return tensor form if `True`, otherwise packed matrix form.
`pad_value`	Tensor padding value for uneven sequence lengths.
`output_format`	Batch data format. Supported aliases: `'pytorch'` (`'pt'`), `'tensorflow'` (`'tf'`), `'jax'` (`'jx'`), `'numpy'` (`'np'`), `'pandas'` (`'df'`), `'pyarrow'` (`'pa'`).
`pin_memory`	For PyTorch tensors, use page-locked CPU memory to speed up GPU transfer.

select

select(
    features: list[str]
) -> Batch

Create a new batch object consisting of only the desired features/columns.

Parameter Name	Description
`features`	Features/columns to keep.

select_numeric

select_numeric() -> Batch

Create a new batch object consisting of numeric features only.

to_parquet

to_parquet(
    filepath: str | Path,
    filesystem: fs.FileSystem | None = None,
    fs_config: dict[str, Any] | None = None,
    overwrite: bool = False
) -> None

Serialize the batch object to a parquet file.

Parameter Name	Description
`filepath`	Path to the parquet file to create.
`filesystem`	Optional PyArrow filesystem instance to write to.
`fs_config`	Filesystem configuration if `filesystem` is not provided.
`overwrite`	If `True`, overwrite an existing file with the same name.

to_tensor

to_tensor(
    pad_value: int | float = np.nan
) -> Batch

Alias for unpack().

Parameter Name	Description
`pad_value`	Tensor padding value for uneven sequence lengths.

unpack

unpack(
    pad_value: int | float = np.nan
) -> Batch

Convert the batch to tensor form.

Parameter Name	Description
`pad_value`	Tensor padding value for uneven sequence lengths.

`read_batches`

from tempora.utils.batch import read_batches

read_batches(
    path: str | Path,
    *,
    filesystem: fs.FileSystem | None = None,
    fs_config: dict[str, Any] | None = None,
    as_tensor: bool = False,
    pad_value: int | float = np.nan,
    output_format: OutputFormat | None = None,
    pin_memory: bool = False
) -> list[Batch]

Deserialize batches from a directory of parquet files.

Parameter Name	Description
`path`	Directory path to serialized batch files.
`filesystem`	Optional PyArrow filesystem to read from.
`fs_config`	Filesystem configuration if `filesystem` is not provided.
`as_tensor`	Return tensor form if `True`, otherwise packed matrix form.
`pad_value`	Tensor padding value for uneven sequence lengths.
`output_format`	Batch data format. Supported aliases: `'pytorch'` (`'pt'`), `'tensorflow'` (`'tf'`), `'jax'` (`'jx'`), `'numpy'` (`'np'`), `'pandas'` (`'df'`), `'pyarrow'` (`'pa'`).
`pin_memory`	For PyTorch tensors, use page-locked CPU memory to speed up GPU transfer.