roboto.storage.cache#

Module Contents#

roboto.storage.cache.COLUMN_COUNT_LOCAL_CACHE_THRESHOLD = 10#

Column count at or above which an ADAPTIVE read downloads the whole file instead of streaming.

Streaming projects only the requested columns but pays a per-column HTTP cost, so reading many columns over the network grows slow; downloading fetches the whole file once. 10 is the crossover: below it, streaming’s wasted-bytes savings win; at or above it, the per-column overhead dominates and the local copy is reliably faster.

class roboto.storage.cache.CachePolicy#

Bases: str, enum.Enum

Governs whether a fetched data file is cached to local disk before reading.

The policy applies to formats with a disk-cache path (Parquet today); a format that always streams (MCAP) ignores it.

ADAPTIVE = 'adaptive'#: Reuse an already-cached file; otherwise download when the read projects enough columns (COLUMN_COUNT_LOCAL_CACHE_THRESHOLD) to justify it, and stream over HTTP when it does not.

ALWAYS = 'always'#: Download the file to the local cache before reading, regardless of how much of it the read projects.

NEVER = 'never'#: Always stream over HTTP; never write to local disk.

class roboto.storage.cache.FetchMode(*args, **kwds)#

Bases: enum.Enum

How a single remote file will be opened, as chosen by choose_fetch_mode().

CACHED#: Open the already-downloaded copy in the local cache.

DOWNLOAD#: Download to the local cache, then open the downloaded copy.

STREAM#: Open over HTTP without writing to disk.

roboto.storage.cache.cached_file_is_current(path, expected_size)#

Report whether the cached file can be reused instead of downloaded again.

Returns True when the file exists and either its size matches expected_size or expected_size is None. A size mismatch means the file’s content changed since the cache was written, so the cached bytes are stale and must not be reused; the file is then treated as absent and re-downloaded. expected_size of None (no size reported) reuses any existing file.

Parameters:

path (pathlib.Path) – Local cache path to check.
expected_size (Optional[int]) – The backing file’s current size in bytes, or None when unknown.

Returns:

True when the cached file may be reused, False when it is missing or stale.

Return type:

bool

roboto.storage.cache.choose_fetch_mode(policy, already_cached, estimated_column_count)#

Pick the cheapest way to open a remote columnar file under a cache policy.

A previously-downloaded file is always cheaper to open than streaming over HTTP, regardless of how many columns the current read projects, so an existing cached copy wins under every policy except NEVER (which never touches the disk cache at all, not even to read it).

Parameters:

policy (CachePolicy) – The caller’s cache policy.
already_cached (bool) – Whether a complete copy already exists in the local cache.
estimated_column_count (int) – How many columns the read is expected to project; compared against COLUMN_COUNT_LOCAL_CACHE_THRESHOLD under ADAPTIVE.

Returns:

The fetch mode to use.

Return type:

FetchMode

roboto.storage.cache.download_to_cache(url_provider, outfile, expected_size=None)#

Download a remote file to outfile safely under concurrency.

Acquires a per-path lock to dedupe in-process downloads, double-checks existence (another thread may have completed the download while we were waiting), creates the cache directory lazily, and writes via a uniquely named .part file followed by os.replace(). The rename is atomic on POSIX and Windows, so:

Readers never see a partial file at outfile.
Two processes racing both produce a complete file; the second os.replace simply overwrites the first.
If the download raises, the .part file is removed and outfile is left untouched.

When expected_size is supplied, the written .part file’s size is verified against it before the atomic rename. A truncated body — which urllib.request.urlretrieve() does not flag when the response carries no Content-Length — fails this check, so the .part is discarded and nothing is promoted to the cache: a partial download is never made sticky.

Parameters:

url_provider (Callable[[], str]) – Resolves the download URL. Called only when the download actually proceeds, so a signed URL is not minted for a file that turns out to already be cached.
outfile (pathlib.Path) – Final cache path for the downloaded file.
expected_size (Optional[int]) – The backing object’s size in bytes when the server reports it; None skips the completeness check.

Raises:

ValueError – expected_size is supplied and the downloaded file’s size does not match it (a truncated or otherwise incomplete download).

Return type:

None

roboto.storage.cache.get_download_lock(key)#

Return the in-process download lock for a cache path, creating it on first use.

Parameters:: key (str)
Return type:: threading.Lock

roboto.storage.cache.logger#