roboto.experimental.topics.decode.parquet#

Module Contents#

roboto.experimental.topics.decode.parquet.CACHED_PARQUET_NAME_PATTERN = '{fs_node_id}.parquet'#: Filename template for locally cached Parquet files; keyed on the stable file id.

roboto.experimental.topics.decode.parquet.decode_parquet_batches(scan_task, timestamp, window, projection_paths, params)#

Wrap each surviving Parquet row group into a RecordBatch, prefixed with the stored-time column.

Empty row groups are skipped. The timestamp column name is suffixed until it stops colliding with a projected column, the same disambiguation the MCAP path applies.

Parameters:

scan_task (roboto.experimental.topics.read_plan.ReadPlanScanTask)
timestamp (roboto.experimental.topics.read_plan.ReadPlanTimestamp)
window (roboto.experimental.topics.read_plan.TimeWindow)
projection_paths (collections.abc.Sequence[roboto.domain.topics.record.FieldPath])
params (roboto.experimental.topics.decode.common.ScanTaskDecodeParams)

Return type:

collections.abc.Generator[pyarrow.RecordBatch, None, None]

roboto.experimental.topics.decode.parquet.parquet_filtered_row_groups(scan_task, timestamp, window, projection_paths, params)#

Yield each surviving row group as (projected_table, stored_timestamps).

The table holds the projected columns only (the timestamp column is read for filtering and dropped when the projection omits it); rows are filtered to window (inclusive on both ends), and stored_timestamps is the aligned int64 nanosecond column.

Parameters:

scan_task (roboto.experimental.topics.read_plan.ReadPlanScanTask)
timestamp (roboto.experimental.topics.read_plan.ReadPlanTimestamp)
window (roboto.experimental.topics.read_plan.TimeWindow)
projection_paths (collections.abc.Sequence[roboto.domain.topics.record.FieldPath])
params (roboto.experimental.topics.decode.common.ScanTaskDecodeParams)

Return type:

collections.abc.Generator[tuple[pyarrow.Table, pyarrow.Int64Array], None, None]

roboto.experimental.topics.decode.parquet.row_group_fully_in_window(row_group_metadata, timestamp, start, end)#

Whether column-chunk statistics prove every row’s timestamp is non-null and within the inclusive window.

Conservative: any absent or incomplete statistic returns False. Returns True only when the timestamp column chunk reports zero nulls and a [min, max] range that sits inside [start, end] — conditions under which the row-level mask would be all-true.

Parameters:

row_group_metadata (pyarrow.parquet.RowGroupMetaData)
timestamp (roboto.formats.parquet.Timestamp)
start (int)
end (int)

Return type:

bool