roboto.formats.parquet.table_transforms#
Module Contents#
- roboto.formats.parquet.table_transforms.compute_time_filter_mask(timestamps, start_time=None, end_time=None)#
Compute a boolean mask indicating which rows fall within the specified time range. Returns None if no time filtering is needed (both start_time and end_time are None).
- Parameters:
timestamps (pyarrow.Array)
start_time (Optional[int])
end_time (Optional[int])
- Return type:
Optional[pyarrow.BooleanArray]
- roboto.formats.parquet.table_transforms.extract_timestamp_field(schema, timestamp_field, unit_hint)#
Aggregate timestamp info into a helper utility for handling time-based data operations.
unit_hintcarries the recorded unit of the stored values for the case where the Arrow column type does not encode one; callers resolve it from their own field record at the bounded-context boundary.- Parameters:
schema (pyarrow.Schema)
timestamp_field (roboto.formats.fields.FieldSelection)
unit_hint (Optional[str])
- Return type:
- roboto.formats.parquet.table_transforms.extract_timestamps(table, timestamp)#
Extract timestamps in nanoseconds since Unix epoch from the table’s timestamp column.
- Parameters:
table (pyarrow.Table)
timestamp (roboto.formats.parquet.timestamp.Timestamp)
- Return type:
pyarrow.Int64Array
- roboto.formats.parquet.table_transforms.narrow_list_nested_fields(table, schema, fields)#
Prune list-of-struct columns to the projected leaves inside each element.
PyArrow’s prefix-based nested column selection cannot reach through list wrapper nodes, so
resolve_columns()reads a list-nested leaf’s whole list ancestor column — every element keeps all of its struct fields. This Arrow-native post-read pass narrows each such element down to the requested leaves, leaving every other read path byte-identical.A top-level root is narrowed iff at least one of its projected paths has a list ancestor; otherwise the table is returned unchanged (pure struct, scalar, and scalar-list reads never enter the rebuild). Per root, a trie is built from its paths with the root component stripped so non-list-nested siblings the projection also keeps are preserved.
- Parameters:
table (pyarrow.Table)
schema (pyarrow.Schema)
fields (collections.abc.Iterable[roboto.formats.fields.FieldSelection])
- Return type:
pyarrow.Table
- roboto.formats.parquet.table_transforms.resolve_columns(schema, fields)#
Build a deduplicated list of column names safe for
read_row_group(columns=...).Children of list-type columns are replaced by their list ancestor’s column name because PyArrow’s prefix-based nested column selection does not work through list wrapper nodes in the physical Parquet schema. Selecting the parent list column already returns its full nested structure.
This is important because the projected fields contain only leaf paths. For a column like
points: list<struct<x, y>>, onlypoints.xandpoints.yare selected — the parentpointsfield is absent. This function derives the correct parent column name from the child’spath_in_schema.Children of struct-type columns are preserved because PyArrow can resolve them via dot-separated prefix matching (e.g.
"position.x"selects thexchild of thepositionstruct).- Parameters:
schema (pyarrow.Schema)
fields (collections.abc.Iterable[roboto.formats.fields.FieldSelection])
- Return type:
list[str]
- roboto.formats.parquet.table_transforms.should_narrow_list_nested_fields(schema, fields)#
Return whether
narrow_list_nested_fields()would change the table.True iff at least one projected field addresses a leaf inside a list (its path has a list ancestor). When False, every projected field resolves through structs and scalars alone, so PyArrow’s column selection already returns the narrowed shape and the post-read prune is a no-op — callers can skip it.
Cheap enough to evaluate once per file and hoist the per-row-group narrowing decision out of the decode loop.
- Parameters:
schema (pyarrow.Schema)
fields (collections.abc.Iterable[roboto.formats.fields.FieldSelection])
- Return type:
bool
- roboto.formats.parquet.table_transforms.should_read_row_group(row_group_metadata, timestamp, start_time=None, end_time=None)#
Determine whether a Parquet row group contains data within the requested time range. Used to short-circuit requesting column chunks from the given row group if not relevant.
- Parameters:
row_group_metadata (pyarrow.parquet.RowGroupMetaData)
timestamp (roboto.formats.parquet.timestamp.Timestamp)
start_time (Optional[int])
end_time (Optional[int])
- Return type:
bool