Skip to content

Batch Methods & Properties

Full reference for batch instance methods and shared properties.

Methods

Name Description
as_format Convert the batch to a different format.
equals Test whether the batch is equal to another batch.
from_parquet Deserialize a batch object from a parquet file.
iter_packed Iterate over data examples/sequences for a packed batch.
read_parquet Alias of from_parquet.
select Create a batch with only selected features/columns.
select_numeric Create a batch with numeric features only.
to_parquet Serialize the batch object to a parquet file.
to_tensor Alias for unpack().
unpack Convert the batch to tensor form.

Properties

Name Description
columns Alias of feature_names.
feature_size Number of batch features/columns.
is_numeric True if all batch features/columns are numeric.
is_packed True if the batch is in packed form, otherwise False.
pad_mask Boolean mask of padded values for tensor (unpacked) batches.

as_format

as_format(
    output_format: OutputFormat,
    pin_memory: bool = False
) -> Batch

Convert the batch to a different format.

Parameter Name Description
output_format Batch data format. Supported aliases: 'pytorch' ('pt'), 'tensorflow' ('tf'), 'jax' ('jx'), 'numpy' ('np'), 'pandas' ('df'), 'pyarrow' ('pa').
pin_memory For PyTorch tensors, use page-locked CPU memory to speed up GPU transfer.

equals

equals(
    other: Batch,
    *,
    check_dtype_precision: bool = True,
    check_metadata: bool = True,
    verbose: bool = True
) -> bool

Test whether the batch is equal to another batch.

Parameter Name Description
other Batch to compare with.
check_dtype_precision Include dtype precision in the comparison.
check_metadata Include metadata in the comparison.
verbose Print details when batches differ.

from_parquet

from_parquet(
    filepath: str | Path,
    filesystem: fs.FileSystem | None = None,
    fs_config: dict[str, Any] | None = None,
    as_tensor: bool = False,
    pad_value: int | float = np.nan,
    output_format: OutputFormat | None = None,
    pin_memory: bool = False
) -> Batch

Deserialize a batch object from a parquet file.

Parameter Name Description
filepath Path to the serialized batch parquet file.
filesystem Optional PyArrow filesystem to read from.
fs_config Filesystem configuration if filesystem is not provided.
as_tensor Return tensor form if True, otherwise packed matrix form.
pad_value Tensor padding value for uneven sequence lengths.
output_format Batch data format. Supported aliases: 'pytorch' ('pt'), 'tensorflow' ('tf'), 'jax' ('jx'), 'numpy' ('np'), 'pandas' ('df'), 'pyarrow' ('pa').
pin_memory For PyTorch tensors, use page-locked CPU memory to speed up GPU transfer.

iter_packed

iter_packed(
    seq_lens: SeqLens | None = None
) -> Iterator[slice]

Iterate over batch data examples/sequences for a batch in packed form.

Parameter Name Description
seq_lens Optional sequence lengths to iterate over (defaults to self.seq_lens).

read_parquet

read_parquet(
    filepath: str | Path,
    filesystem: fs.FileSystem | None = None,
    fs_config: dict[str, Any] | None = None,
    as_tensor: bool = False,
    pad_value: int | float = np.nan,
    output_format: OutputFormat | None = None,
    pin_memory: bool = False
) -> Batch

Alias of from_parquet (for compatibility with the Pandas DataFrame API).

Parameter Name Description
filepath Path to the serialized batch parquet file.
filesystem Optional PyArrow filesystem to read from.
fs_config Filesystem configuration if filesystem is not provided.
as_tensor Return tensor form if True, otherwise packed matrix form.
pad_value Tensor padding value for uneven sequence lengths.
output_format Batch data format. Supported aliases: 'pytorch' ('pt'), 'tensorflow' ('tf'), 'jax' ('jx'), 'numpy' ('np'), 'pandas' ('df'), 'pyarrow' ('pa').
pin_memory For PyTorch tensors, use page-locked CPU memory to speed up GPU transfer.

select

select(
    features: list[str]
) -> Batch

Create a new batch object consisting of only the desired features/columns.

Parameter Name Description
features Features/columns to keep.

select_numeric

select_numeric() -> Batch

Create a new batch object consisting of numeric features only.

to_parquet

to_parquet(
    filepath: str | Path,
    filesystem: fs.FileSystem | None = None,
    fs_config: dict[str, Any] | None = None,
    overwrite: bool = False
) -> None

Serialize the batch object to a parquet file.

Parameter Name Description
filepath Path to the parquet file to create.
filesystem Optional PyArrow filesystem instance to write to.
fs_config Filesystem configuration if filesystem is not provided.
overwrite If True, overwrite an existing file with the same name.

to_tensor

to_tensor(
    pad_value: int | float = np.nan
) -> Batch

Alias for unpack().

Parameter Name Description
pad_value Tensor padding value for uneven sequence lengths.

unpack

unpack(
    pad_value: int | float = np.nan
) -> Batch

Convert the batch to tensor form.

Parameter Name Description
pad_value Tensor padding value for uneven sequence lengths.