File System Options
These classes provide serializable wrappers around PyArrow file and dataset options.
Each has an as_pyarrow property that returns the underlying PyArrow options object.
Read Options
CSVReadOptions
from tempora.datasets import CSVReadOptions
Serializable version of pa.csv.ReadOptions.
CSVReadOptions(
use_threads: bool = True,
block_size: int | None = None,
skip_rows: int = 0,
skip_rows_after_names: int = 0,
column_names: list[str] | None = None,
autogenerate_column_names: bool = False,
encoding: str = 'utf8'
)
Parameters
| Name | Description |
|---|---|
use_threads |
Whether to use multiple threads to accelerate reading. |
block_size |
How many bytes to process at a time from the input stream. |
skip_rows |
Number of rows to skip before the column names and data. |
skip_rows_after_names |
Number of rows to skip after the column names. |
column_names |
Column names for the target table. |
autogenerate_column_names |
Autogenerate column names if column_names is empty. |
encoding |
Character encoding of the CSV data. |
JSONReadOptions
from tempora.datasets import JSONReadOptions
Serializable version of pa.json.ReadOptions.
JSONReadOptions(
use_threads: bool = True,
block_size: int | None = None
)
Parameters
| Name | Description |
|---|---|
use_threads |
Whether to use multiple threads to accelerate reading. |
block_size |
How many bytes to process at a time from the input stream. |
ParquetReadOptions
from tempora.datasets import ParquetReadOptions
Serializable version of pa.ds.ParquetReadOptions.
ParquetReadOptions(
dictionary_columns: list[str] | None = None,
coerce_int96_timestamp_unit: str | None = None
)
Parameters
| Name | Description |
|---|---|
dictionary_columns |
Column names to dictionary-encode as they are read. |
coerce_int96_timestamp_unit |
Timestamp unit for INT96 timestamps (e.g., 'ms'). |
Parse Options
CSVParseOptions
from tempora.datasets import CSVParseOptions
Serializable version of pa.csv.ParseOptions.
CSVParseOptions(
delimiter: str = ', ',
quote_char: str | bool = '"',
double_quote: bool = True,
escape_char: str | bool = False,
newlines_in_values: bool = False,
ignore_empty_lines: bool = True
)
Parameters
| Name | Description |
|---|---|
delimiter |
Character delimiting individual cells in the CSV data. |
quote_char |
Character used for quoting CSV values. |
double_quote |
Whether two quotes in a quoted value denote a single quote. |
escape_char |
Character used for escaping special characters. |
newlines_in_values |
Whether newline characters are allowed in CSV values. |
ignore_empty_lines |
Whether empty lines are ignored. |
JSONParseOptions
from tempora.datasets import JSONParseOptions
Serializable version of pa.json.ParseOptions.
JSONParseOptions(
explicit_schema: pa.Schema | None = None,
newlines_in_values: bool = False,
unexpected_field_behavior: Literal['ignore', 'error', 'infer'] = 'infer'
)
Parameters
| Name | Description |
|---|---|
explicit_schema |
Explicit schema (no type inference, ignores other fields). |
newlines_in_values |
Whether objects may be printed across multiple lines. |
unexpected_field_behavior |
How unexpected fields are handled ('ignore', 'error', 'infer'). |
Convert Options
CSVConvertOptions
from tempora.datasets import CSVConvertOptions
Serializable version of pa.csv.ConvertOptions.
CSVConvertOptions(
check_utf8: bool = True,
column_types: pa.Schema | None = None,
null_values: list[str] | None = None,
decimal_point: str = '.',
strings_can_be_null: bool = False,
quoted_strings_can_be_null: bool = True,
auto_dict_encode: bool = False,
auto_dict_max_cardinality: int | None = None,
timestamp_parsers: list[str] | None = None
)
Parameters
| Name | Description |
|---|---|
check_utf8 |
Whether to check UTF-8 validity of string columns. |
column_types |
Explicit mapping of column names to types. |
null_values |
Strings that denote nulls in the data. |
decimal_point |
Character used as decimal point. |
strings_can_be_null |
Whether string/binary columns can have nulls. |
quoted_strings_can_be_null |
Whether quoted values can be null. |
auto_dict_encode |
Whether to auto dict-encode string/binary data. |
auto_dict_max_cardinality |
Maximum dictionary cardinality per chunk. |
timestamp_parsers |
Strptime-compatible timestamp formats. |
Partitioning Options
DirectoryPartitioning
from tempora.datasets import DirectoryPartitioning
Serializable version of pa.ds.DirectoryPartitioning.
DirectoryPartitioning(
schema: pa.Schema,
dictionaries: dict[str, list[Any]] | None = None,
segment_encoding: str = 'uri'
)
Parameters
| Name | Description |
|---|---|
schema |
Schema describing partitions present in the file path. |
dictionaries |
Dictionary values for dictionary-typed fields in schema. |
segment_encoding |
Decode the path segments after splitting. |
FilenamePartitioning
from tempora.datasets import FilenamePartitioning
Serializable version of pa.ds.FilenamePartitioning.
FilenamePartitioning(
schema: pa.Schema,
dictionaries: dict[str, list[Any]] | None = None,
segment_encoding: str = 'uri'
)
Parameters
| Name | Description |
|---|---|
schema |
Schema describing partitions present in the file path. |
dictionaries |
Dictionary values for dictionary-typed fields in schema. |
segment_encoding |
Decode the path segments after splitting. |
HivePartitioning
from tempora.datasets import HivePartitioning
Serializable version of pa.ds.HivePartitioning.
HivePartitioning(
schema: pa.Schema,
dictionaries: dict[str, list[Any]] | None = None,
null_fallback: str = '__HIVE_DEFAULT_PARTITION__',
segment_encoding: str = 'uri'
)
Parameters
| Name | Description |
|---|---|
schema |
Schema describing partitions present in the file path. |
dictionaries |
Dictionary values for dictionary-typed fields in schema. |
null_fallback |
Label to use when a field is null. |
segment_encoding |
Decode the path segments after splitting. |