roboto.formats.parquet.fetch#
Module Contents#
- roboto.formats.parquet.fetch.logger#
- roboto.formats.parquet.fetch.open_parquet_file(url_provider, cache_outfile, policy, estimated_column_count, size_bytes=None)#
Open a remote Parquet file under a cache policy, from the cheapest available source.
Dispatches on
choose_fetch_mode(): an already-cached copy is reused, a download is performed (concurrency-safe, atomic) when the policy calls for one, and otherwise the file is streamed over HTTP range requests without touching disk.- Parameters:
url_provider (Callable[[], str]) – Resolves the file’s signed download URL. Called at most once, and only when the chosen mode actually needs the URL.
cache_outfile (Optional[pathlib.Path]) – The file’s stable local cache path, or
Nonewhen no cache location is configured (forces streaming).policy (roboto.storage.cache.CachePolicy) – The caller’s cache policy.
estimated_column_count (int) – How many columns the read is expected to project; informs the
ADAPTIVEdownload-vs-stream choice.size_bytes (Optional[int]) – The backing object’s size in bytes when the server reports it;
Nonewhen unknown. It gates cache reuse: a present cached file whose size does not match is treated as stale and re-fetched. On the STREAM path it lets a known-large file skip the whole-file head probe; on the DOWNLOAD path it verifies the downloaded file is complete before it is promoted to the cache.
- Returns:
An open
pyarrow.parquet.ParquetFile.- Return type:
pyarrow.parquet.ParquetFile
- roboto.formats.parquet.fetch.parquet_file_from_url(signed_url, size_bytes=None)#
Open a Parquet file over HTTP via a signed URL (no local download).
A single ranged GET over fsspec’s shared HTTP session probes the first
_STREAM_WHOLE_FILE_PROBE_BYTESof the file. A file smaller than the probe arrives whole in that one request and is read from an in-memory buffer; a larger file (or a failed probe) falls back to HTTP range-request streaming through pyarrow’s filesystem layer.When
size_bytesis known and at least_STREAM_WHOLE_FILE_PROBE_BYTES, the file is known-large up front: the whole-file probe could never win (a BufferReader read is taken only for sub-threshold files), so it is skipped and the file is range-streamed directly, avoiding a wasted 16 MiB GET.size_bytesofNone(an older server omits the size) preserves the probe-then-decide behavior.- Parameters:
signed_url (str) – The file’s signed download URL.
size_bytes (Optional[int]) – The backing object’s size in bytes when the server reports it;
Nonewhen unknown.
- Raises:
ValueError – The probe succeeds but the object is empty (0 bytes), which is not a readable Parquet file.
- Return type:
pyarrow.parquet.ParquetFile