roboto.formats.parquet.table_transforms#

Module Contents#

roboto.formats.parquet.table_transforms.compute_time_filter_mask(timestamps, start_time=None, end_time=None)#

Compute a boolean mask indicating which rows fall within the specified time range. Returns None if no time filtering is needed (both start_time and end_time are None).

Parameters:
  • timestamps (pyarrow.Array)

  • start_time (Optional[int])

  • end_time (Optional[int])

Return type:

Optional[pyarrow.BooleanArray]

roboto.formats.parquet.table_transforms.extract_timestamp_field(schema, timestamp_field, unit_hint)#

Aggregate timestamp info into a helper utility for handling time-based data operations.

unit_hint carries the recorded unit of the stored values for the case where the Arrow column type does not encode one; callers resolve it from their own field record at the bounded-context boundary.

Parameters:
Return type:

roboto.formats.parquet.timestamp.Timestamp

roboto.formats.parquet.table_transforms.extract_timestamps(table, timestamp)#

Extract timestamps in nanoseconds since Unix epoch from the table’s timestamp column.

Parameters:
Return type:

pyarrow.Int64Array

roboto.formats.parquet.table_transforms.narrow_list_nested_fields(table, schema, fields)#

Prune list-of-struct columns to the projected leaves inside each element.

PyArrow’s prefix-based nested column selection cannot reach through list wrapper nodes, so resolve_columns() reads a list-nested leaf’s whole list ancestor column — every element keeps all of its struct fields. This Arrow-native post-read pass narrows each such element down to the requested leaves, leaving every other read path byte-identical.

A top-level root is narrowed iff at least one of its projected paths has a list ancestor; otherwise the table is returned unchanged (pure struct, scalar, and scalar-list reads never enter the rebuild). Per root, a trie is built from its paths with the root component stripped so non-list-nested siblings the projection also keeps are preserved.

Parameters:
Return type:

pyarrow.Table

roboto.formats.parquet.table_transforms.resolve_columns(schema, fields)#

Build a deduplicated list of column names safe for read_row_group(columns=...).

Children of list-type columns are replaced by their list ancestor’s column name because PyArrow’s prefix-based nested column selection does not work through list wrapper nodes in the physical Parquet schema. Selecting the parent list column already returns its full nested structure.

This is important because the projected fields contain only leaf paths. For a column like points: list<struct<x, y>>, only points.x and points.y are selected — the parent points field is absent. This function derives the correct parent column name from the child’s path_in_schema.

Children of struct-type columns are preserved because PyArrow can resolve them via dot-separated prefix matching (e.g. "position.x" selects the x child of the position struct).

Parameters:
Return type:

list[str]

roboto.formats.parquet.table_transforms.should_narrow_list_nested_fields(schema, fields)#

Return whether narrow_list_nested_fields() would change the table.

True iff at least one projected field addresses a leaf inside a list (its path has a list ancestor). When False, every projected field resolves through structs and scalars alone, so PyArrow’s column selection already returns the narrowed shape and the post-read prune is a no-op — callers can skip it.

Cheap enough to evaluate once per file and hoist the per-row-group narrowing decision out of the decode loop.

Parameters:
Return type:

bool

roboto.formats.parquet.table_transforms.should_read_row_group(row_group_metadata, timestamp, start_time=None, end_time=None)#

Determine whether a Parquet row group contains data within the requested time range. Used to short-circuit requesting column chunks from the given row group if not relevant.

Parameters:
Return type:

bool