Batch Objects
Batch
from tempora.utils.batch import Batch
Class for batches generated by a Sampler.
Batch(
data: BatchData,
seq_lens: SeqLens,
feature_names: list[str],
dtypes: list[Dtype],
targets: BatchData | None = None,
target_seq_lens: SeqLens | None = None,
target_names: list[str] | None = None,
target_dtypes: list[Dtype] | None = None,
metadata: BatchMetadata | None = None
)
Parameters
| Name | Description |
|---|---|
data |
Batch data either in tensor or (packed) matrix form. |
seq_lens |
Length of each data sequence in the batch. |
feature_names |
Names of the batch features/columns. |
dtypes |
Batch data dtypes. |
targets |
Optional batch targets. |
target_seq_lens |
Length of each target sequence. |
target_names |
Names of the target features/columns. |
target_dtypes |
Target dtypes. |
metadata |
Batch metadata. |
Properties
| Name | Description |
|---|---|
columns |
Alias of feature_names. |
feature_size |
Number of batch features/columns. |
is_numeric |
True if all batch features/columns are numeric. |
is_packed |
True if the batch is in packed form, otherwise False. |
pad_mask |
Boolean mask of padded values for tensor (unpacked) batches. |
| Name | Description |
|---|---|
as_format |
Convert the batch to a different format. |
equals |
Test whether the batch is equal to another batch. |
from_parquet |
Deserialize a batch object from a parquet file. |
from_segments |
Create a batch from a list of segments. |
iter_packed |
Iterate over data examples/sequences for a packed batch. |
read_parquet |
Alias of from_parquet. |
select |
Create a batch with only selected features/columns. |
select_numeric |
Create a batch with numeric features only. |
to_parquet |
Serialize the batch object to a parquet file. |
to_tensor |
Alias for unpack(). |
unpack |
Convert the batch to tensor form. |
as_format(
output_format: OutputFormat,
pin_memory: bool = False
) -> Batch
Convert the batch to a different format.
| Parameter Name | Description |
|---|---|
output_format |
Batch data format. Supported aliases: 'pytorch' ('pt'), 'tensorflow' ('tf'), 'jax' ('jx'), 'numpy' ('np'), 'pandas' ('df'), 'pyarrow' ('pa'). |
pin_memory |
For PyTorch tensors, use page-locked CPU memory to speed up GPU transfer. |
equals(
other: Batch,
*,
check_dtype_precision: bool = True,
check_metadata: bool = True,
verbose: bool = True
) -> bool
Test whether the batch is equal to another batch.
| Parameter Name | Description |
|---|---|
other |
Batch to compare with. |
check_dtype_precision |
Include dtype precision in the comparison. |
check_metadata |
Include metadata in the comparison. |
verbose |
Print details when batches differ. |
from_parquet(
filepath: str | Path,
filesystem: fs.FileSystem | None = None,
fs_config: dict[str, Any] | None = None,
as_tensor: bool = False,
pad_value: int | float = np.nan,
output_format: OutputFormat | None = None,
pin_memory: bool = False
) -> Batch
Deserialize a batch object from a parquet file.
| Parameter Name | Description |
|---|---|
filepath |
Path to the serialized batch parquet file. |
filesystem |
Optional PyArrow filesystem to read from. |
fs_config |
Filesystem configuration if filesystem is not provided. |
as_tensor |
Return tensor form if True, otherwise packed matrix form. |
pad_value |
Tensor padding value for uneven sequence lengths. |
output_format |
Batch data format. Supported aliases: 'pytorch' ('pt'), 'tensorflow' ('tf'), 'jax' ('jx'), 'numpy' ('np'), 'pandas' ('df'), 'pyarrow' ('pa'). |
pin_memory |
For PyTorch tensors, use page-locked CPU memory to speed up GPU transfer. |
from_segments(
segments: list[TSSegment],
as_tensor: bool = False,
pad_value: int | float = np.nan,
output_format: OutputFormat | None = None,
pin_memory: bool = False
) -> Batch
Create a batch from a list of segments.
| Parameter Name | Description |
|---|---|
segments |
List of segments to combine into a batch. |
as_tensor |
Return tensor form if True, otherwise packed matrix form. |
pad_value |
Tensor padding value for uneven sequence lengths. |
output_format |
Batch data format. Supported aliases: 'pytorch' ('pt'), 'tensorflow' ('tf'), 'jax' ('jx'), 'numpy' ('np'), 'pandas' ('df'), 'pyarrow' ('pa'). |
pin_memory |
For PyTorch tensors, use page-locked CPU memory to speed up GPU transfer. |
iter_packed(
seq_lens: SeqLens | None = None
) -> Iterator[slice]
Iterate over batch data examples/sequences for a batch in packed form.
| Parameter Name | Description |
|---|---|
seq_lens |
Optional sequence lengths to iterate over (defaults to self.seq_lens). |
read_parquet(
filepath: str | Path,
filesystem: fs.FileSystem | None = None,
fs_config: dict[str, Any] | None = None,
as_tensor: bool = False,
pad_value: int | float = np.nan,
output_format: OutputFormat | None = None,
pin_memory: bool = False
) -> Batch
Alias of from_parquet (for compatibility with the Pandas Dataframe API).
| Parameter Name | Description |
|---|---|
filepath |
Path to the serialized batch parquet file. |
filesystem |
Optional PyArrow filesystem to read from. |
fs_config |
Filesystem configuration if filesystem is not provided. |
as_tensor |
Return tensor form if True, otherwise packed matrix form. |
pad_value |
Tensor padding value for uneven sequence lengths. |
output_format |
Batch data format. Supported aliases: 'pytorch' ('pt'), 'tensorflow' ('tf'), 'jax' ('jx'), 'numpy' ('np'), 'pandas' ('df'), 'pyarrow' ('pa'). |
pin_memory |
For PyTorch tensors, use page-locked CPU memory to speed up GPU transfer. |
select(
features: list[str]
) -> Batch
Create a new batch object consisting of only the desired features/columns.
| Parameter Name | Description |
|---|---|
features |
Features/columns to keep. |
select_numeric() -> Batch
Create a new batch object consisting of numeric features only.
to_parquet(
filepath: str | Path,
filesystem: fs.FileSystem | None = None,
fs_config: dict[str, Any] | None = None,
overwrite: bool = False
) -> None
Serialize the batch object to a parquet file.
| Parameter Name | Description |
|---|---|
filepath |
Path to the parquet file to create. |
filesystem |
Optional PyArrow filesystem instance to write to. |
fs_config |
Filesystem configuration if filesystem is not provided. |
overwrite |
If True, overwrite an existing file with the same name. |
to_tensor(
pad_value: int | float = np.nan
) -> Batch
Alias for unpack().
| Parameter Name | Description |
|---|---|
pad_value |
Tensor padding value for uneven sequence lengths. |
unpack(
pad_value: int | float = np.nan
) -> Batch
Convert the batch to tensor form.
| Parameter Name | Description |
|---|---|
pad_value |
Tensor padding value for uneven sequence lengths. |
read_batches
from tempora.utils.batch import read_batches
read_batches(
path: str | Path,
*,
filesystem: fs.FileSystem | None = None,
fs_config: dict[str, Any] | None = None,
as_tensor: bool = False,
pad_value: int | float = np.nan,
output_format: OutputFormat | None = None,
pin_memory: bool = False
) -> list[Batch]
Deserialize batches from a directory of parquet files.
| Parameter Name | Description |
|---|---|
path |
Directory path to serialized batch files. |
filesystem |
Optional PyArrow filesystem to read from. |
fs_config |
Filesystem configuration if filesystem is not provided. |
as_tensor |
Return tensor form if True, otherwise packed matrix form. |
pad_value |
Tensor padding value for uneven sequence lengths. |
output_format |
Batch data format. Supported aliases: 'pytorch' ('pt'), 'tensorflow' ('tf'), 'jax' ('jx'), 'numpy' ('np'), 'pandas' ('df'), 'pyarrow' ('pa'). |
pin_memory |
For PyTorch tensors, use page-locked CPU memory to speed up GPU transfer. |