Skip to content

File System Options

These classes provide serializable wrappers around PyArrow file and dataset options. Each has an as_pyarrow property that returns the underlying PyArrow options object.

Read Options

CSVReadOptions

from tempora.datasets import CSVReadOptions

Serializable version of pa.csv.ReadOptions.

CSVReadOptions(
    use_threads: bool = True,
    block_size: int | None = None,
    skip_rows: int = 0,
    skip_rows_after_names: int = 0,
    column_names: list[str] | None = None,
    autogenerate_column_names: bool = False,
    encoding: str = 'utf8'
)

Parameters

Name Description
use_threads Whether to use multiple threads to accelerate reading.
block_size How many bytes to process at a time from the input stream.
skip_rows Number of rows to skip before the column names and data.
skip_rows_after_names Number of rows to skip after the column names.
column_names Column names for the target table.
autogenerate_column_names Autogenerate column names if column_names is empty.
encoding Character encoding of the CSV data.

JSONReadOptions

from tempora.datasets import JSONReadOptions

Serializable version of pa.json.ReadOptions.

JSONReadOptions(
    use_threads: bool = True,
    block_size: int | None = None
)

Parameters

Name Description
use_threads Whether to use multiple threads to accelerate reading.
block_size How many bytes to process at a time from the input stream.

ParquetReadOptions

from tempora.datasets import ParquetReadOptions

Serializable version of pa.ds.ParquetReadOptions.

ParquetReadOptions(
    dictionary_columns: list[str] | None = None,
    coerce_int96_timestamp_unit: str | None = None
)

Parameters

Name Description
dictionary_columns Column names to dictionary-encode as they are read.
coerce_int96_timestamp_unit Timestamp unit for INT96 timestamps (e.g., 'ms').

Parse Options

CSVParseOptions

from tempora.datasets import CSVParseOptions

Serializable version of pa.csv.ParseOptions.

CSVParseOptions(
    delimiter: str = ', ',
    quote_char: str | bool = '"',
    double_quote: bool = True,
    escape_char: str | bool = False,
    newlines_in_values: bool = False,
    ignore_empty_lines: bool = True
)

Parameters

Name Description
delimiter Character delimiting individual cells in the CSV data.
quote_char Character used for quoting CSV values.
double_quote Whether two quotes in a quoted value denote a single quote.
escape_char Character used for escaping special characters.
newlines_in_values Whether newline characters are allowed in CSV values.
ignore_empty_lines Whether empty lines are ignored.

JSONParseOptions

from tempora.datasets import JSONParseOptions

Serializable version of pa.json.ParseOptions.

JSONParseOptions(
    explicit_schema: pa.Schema | None = None,
    newlines_in_values: bool = False,
    unexpected_field_behavior: Literal['ignore', 'error', 'infer'] = 'infer'
)

Parameters

Name Description
explicit_schema Explicit schema (no type inference, ignores other fields).
newlines_in_values Whether objects may be printed across multiple lines.
unexpected_field_behavior How unexpected fields are handled ('ignore', 'error', 'infer').

Convert Options

CSVConvertOptions

from tempora.datasets import CSVConvertOptions

Serializable version of pa.csv.ConvertOptions.

CSVConvertOptions(
    check_utf8: bool = True,
    column_types: pa.Schema | None = None,
    null_values: list[str] | None = None,
    decimal_point: str = '.',
    strings_can_be_null: bool = False,
    quoted_strings_can_be_null: bool = True,
    auto_dict_encode: bool = False,
    auto_dict_max_cardinality: int | None = None,
    timestamp_parsers: list[str] | None = None
)

Parameters

Name Description
check_utf8 Whether to check UTF-8 validity of string columns.
column_types Explicit mapping of column names to types.
null_values Strings that denote nulls in the data.
decimal_point Character used as decimal point.
strings_can_be_null Whether string/binary columns can have nulls.
quoted_strings_can_be_null Whether quoted values can be null.
auto_dict_encode Whether to auto dict-encode string/binary data.
auto_dict_max_cardinality Maximum dictionary cardinality per chunk.
timestamp_parsers Strptime-compatible timestamp formats.

Partitioning Options

DirectoryPartitioning

from tempora.datasets import DirectoryPartitioning

Serializable version of pa.ds.DirectoryPartitioning.

DirectoryPartitioning(
    schema: pa.Schema,
    dictionaries: dict[str, list[Any]] | None = None,
    segment_encoding: str = 'uri'
)

Parameters

Name Description
schema Schema describing partitions present in the file path.
dictionaries Dictionary values for dictionary-typed fields in schema.
segment_encoding Decode the path segments after splitting.

FilenamePartitioning

from tempora.datasets import FilenamePartitioning

Serializable version of pa.ds.FilenamePartitioning.

FilenamePartitioning(
    schema: pa.Schema,
    dictionaries: dict[str, list[Any]] | None = None,
    segment_encoding: str = 'uri'
)

Parameters

Name Description
schema Schema describing partitions present in the file path.
dictionaries Dictionary values for dictionary-typed fields in schema.
segment_encoding Decode the path segments after splitting.

HivePartitioning

from tempora.datasets import HivePartitioning

Serializable version of pa.ds.HivePartitioning.

HivePartitioning(
    schema: pa.Schema,
    dictionaries: dict[str, list[Any]] | None = None,
    null_fallback: str = '__HIVE_DEFAULT_PARTITION__',
    segment_encoding: str = 'uri'
)

Parameters

Name Description
schema Schema describing partitions present in the file path.
dictionaries Dictionary values for dictionary-typed fields in schema.
null_fallback Label to use when a field is null.
segment_encoding Decode the path segments after splitting.