File System Options

These classes provide serializable wrappers around PyArrow file and dataset options. Each has an as_pyarrow property that returns the underlying PyArrow options object.

Read Options

`CSVReadOptions`

from tempora.datasets import CSVReadOptions

Serializable version of pa.csv.ReadOptions.

CSVReadOptions(
    use_threads: bool = True,
    block_size: int | None = None,
    skip_rows: int = 0,
    skip_rows_after_names: int = 0,
    column_names: list[str] | None = None,
    autogenerate_column_names: bool = False,
    encoding: str = 'utf8'
)

Parameters

Name	Description
`use_threads`	Whether to use multiple threads to accelerate reading.
`block_size`	How many bytes to process at a time from the input stream.
`skip_rows`	Number of rows to skip before the column names and data.
`skip_rows_after_names`	Number of rows to skip after the column names.
`column_names`	Column names for the target table.
`autogenerate_column_names`	Autogenerate column names if `column_names` is empty.
`encoding`	Character encoding of the CSV data.

`JSONReadOptions`

from tempora.datasets import JSONReadOptions

Serializable version of pa.json.ReadOptions.

JSONReadOptions(
    use_threads: bool = True,
    block_size: int | None = None
)

Parameters

Name	Description
`use_threads`	Whether to use multiple threads to accelerate reading.
`block_size`	How many bytes to process at a time from the input stream.

`ParquetReadOptions`

from tempora.datasets import ParquetReadOptions

Serializable version of pa.ds.ParquetReadOptions.

ParquetReadOptions(
    dictionary_columns: list[str] | None = None,
    coerce_int96_timestamp_unit: str | None = None
)

Parameters

Name	Description
`dictionary_columns`	Column names to dictionary-encode as they are read.
`coerce_int96_timestamp_unit`	Timestamp unit for INT96 timestamps (e.g., `'ms'`).

Parse Options

`CSVParseOptions`

from tempora.datasets import CSVParseOptions

Serializable version of pa.csv.ParseOptions.

CSVParseOptions(
    delimiter: str = ', ',
    quote_char: str | bool = '"',
    double_quote: bool = True,
    escape_char: str | bool = False,
    newlines_in_values: bool = False,
    ignore_empty_lines: bool = True
)

Parameters

Name	Description
`delimiter`	Character delimiting individual cells in the CSV data.
`quote_char`	Character used for quoting CSV values.
`double_quote`	Whether two quotes in a quoted value denote a single quote.
`escape_char`	Character used for escaping special characters.
`newlines_in_values`	Whether newline characters are allowed in CSV values.
`ignore_empty_lines`	Whether empty lines are ignored.

`JSONParseOptions`

from tempora.datasets import JSONParseOptions

Serializable version of pa.json.ParseOptions.

JSONParseOptions(
    explicit_schema: pa.Schema | None = None,
    newlines_in_values: bool = False,
    unexpected_field_behavior: Literal['ignore', 'error', 'infer'] = 'infer'
)

Parameters

Name	Description
`explicit_schema`	Explicit schema (no type inference, ignores other fields).
`newlines_in_values`	Whether objects may be printed across multiple lines.
`unexpected_field_behavior`	How unexpected fields are handled (`'ignore'`, `'error'`, `'infer'`).

Convert Options

`CSVConvertOptions`

from tempora.datasets import CSVConvertOptions

Serializable version of pa.csv.ConvertOptions.

CSVConvertOptions(
    check_utf8: bool = True,
    column_types: pa.Schema | None = None,
    null_values: list[str] | None = None,
    decimal_point: str = '.',
    strings_can_be_null: bool = False,
    quoted_strings_can_be_null: bool = True,
    auto_dict_encode: bool = False,
    auto_dict_max_cardinality: int | None = None,
    timestamp_parsers: list[str] | None = None
)

Parameters

Name	Description
`check_utf8`	Whether to check UTF-8 validity of string columns.
`column_types`	Explicit mapping of column names to types.
`null_values`	Strings that denote nulls in the data.
`decimal_point`	Character used as decimal point.
`strings_can_be_null`	Whether string/binary columns can have nulls.
`quoted_strings_can_be_null`	Whether quoted values can be null.
`auto_dict_encode`	Whether to auto dict-encode string/binary data.
`auto_dict_max_cardinality`	Maximum dictionary cardinality per chunk.
`timestamp_parsers`	Strptime-compatible timestamp formats.

Partitioning Options

`DirectoryPartitioning`

from tempora.datasets import DirectoryPartitioning

Serializable version of pa.ds.DirectoryPartitioning.

DirectoryPartitioning(
    schema: pa.Schema,
    dictionaries: dict[str, list[Any]] | None = None,
    segment_encoding: str = 'uri'
)

Parameters

Name	Description
`schema`	Schema describing partitions present in the file path.
`dictionaries`	Dictionary values for dictionary-typed fields in `schema`.
`segment_encoding`	Decode the path segments after splitting.

`FilenamePartitioning`

from tempora.datasets import FilenamePartitioning

Serializable version of pa.ds.FilenamePartitioning.

FilenamePartitioning(
    schema: pa.Schema,
    dictionaries: dict[str, list[Any]] | None = None,
    segment_encoding: str = 'uri'
)

Parameters

Name	Description
`schema`	Schema describing partitions present in the file path.
`dictionaries`	Dictionary values for dictionary-typed fields in `schema`.
`segment_encoding`	Decode the path segments after splitting.

`HivePartitioning`

from tempora.datasets import HivePartitioning

Serializable version of pa.ds.HivePartitioning.

HivePartitioning(
    schema: pa.Schema,
    dictionaries: dict[str, list[Any]] | None = None,
    null_fallback: str = '__HIVE_DEFAULT_PARTITION__',
    segment_encoding: str = 'uri'
)

Parameters

Name	Description
`schema`	Schema describing partitions present in the file path.
`dictionaries`	Dictionary values for dictionary-typed fields in `schema`.
`null_fallback`	Label to use when a field is null.
`segment_encoding`	Decode the path segments after splitting.