roboto.formats.parquet.fetch#

Module Contents#

roboto.formats.parquet.fetch.logger#
roboto.formats.parquet.fetch.open_parquet_file(url_provider, cache_outfile, policy, estimated_column_count, size_bytes=None)#

Open a remote Parquet file under a cache policy, from the cheapest available source.

Dispatches on choose_fetch_mode(): an already-cached copy is reused, a download is performed (concurrency-safe, atomic) when the policy calls for one, and otherwise the file is streamed over HTTP range requests without touching disk.

Parameters:
  • url_provider (Callable[[], str]) – Resolves the file’s signed download URL. Called at most once, and only when the chosen mode actually needs the URL.

  • cache_outfile (Optional[pathlib.Path]) – The file’s stable local cache path, or None when no cache location is configured (forces streaming).

  • policy (roboto.storage.cache.CachePolicy) – The caller’s cache policy.

  • estimated_column_count (int) – How many columns the read is expected to project; informs the ADAPTIVE download-vs-stream choice.

  • size_bytes (Optional[int]) – The backing object’s size in bytes when the server reports it; None when unknown. It gates cache reuse: a present cached file whose size does not match is treated as stale and re-fetched. On the STREAM path it lets a known-large file skip the whole-file head probe; on the DOWNLOAD path it verifies the downloaded file is complete before it is promoted to the cache.

Returns:

An open pyarrow.parquet.ParquetFile.

Return type:

pyarrow.parquet.ParquetFile

roboto.formats.parquet.fetch.parquet_file_from_url(signed_url, size_bytes=None)#

Open a Parquet file over HTTP via a signed URL (no local download).

A single ranged GET over fsspec’s shared HTTP session probes the first _STREAM_WHOLE_FILE_PROBE_BYTES of the file. A file smaller than the probe arrives whole in that one request and is read from an in-memory buffer; a larger file (or a failed probe) falls back to HTTP range-request streaming through pyarrow’s filesystem layer.

When size_bytes is known and at least _STREAM_WHOLE_FILE_PROBE_BYTES, the file is known-large up front: the whole-file probe could never win (a BufferReader read is taken only for sub-threshold files), so it is skipped and the file is range-streamed directly, avoiding a wasted 16 MiB GET. size_bytes of None (an older server omits the size) preserves the probe-then-decide behavior.

Parameters:
  • signed_url (str) – The file’s signed download URL.

  • size_bytes (Optional[int]) – The backing object’s size in bytes when the server reports it; None when unknown.

Raises:

ValueError – The probe succeeds but the object is empty (0 bytes), which is not a readable Parquet file.

Return type:

pyarrow.parquet.ParquetFile