roboto.storage.cache#
Module Contents#
- roboto.storage.cache.COLUMN_COUNT_LOCAL_CACHE_THRESHOLD = 10#
Column count at or above which an
ADAPTIVEread downloads the whole file instead of streaming.Streaming projects only the requested columns but pays a per-column HTTP cost, so reading many columns over the network grows slow; downloading fetches the whole file once. 10 is the crossover: below it, streaming’s wasted-bytes savings win; at or above it, the per-column overhead dominates and the local copy is reliably faster.
- class roboto.storage.cache.CachePolicy#
Bases:
str,enum.EnumGoverns whether a fetched data file is cached to local disk before reading.
The policy applies to formats with a disk-cache path (Parquet today); a format that always streams (MCAP) ignores it.
- ADAPTIVE = 'adaptive'#
Reuse an already-cached file; otherwise download when the read projects enough columns (
COLUMN_COUNT_LOCAL_CACHE_THRESHOLD) to justify it, and stream over HTTP when it does not.
- ALWAYS = 'always'#
Download the file to the local cache before reading, regardless of how much of it the read projects.
- NEVER = 'never'#
Always stream over HTTP; never write to local disk.
- class roboto.storage.cache.FetchMode(*args, **kwds)#
Bases:
enum.EnumHow a single remote file will be opened, as chosen by
choose_fetch_mode().- CACHED#
Open the already-downloaded copy in the local cache.
- DOWNLOAD#
Download to the local cache, then open the downloaded copy.
- STREAM#
Open over HTTP without writing to disk.
- roboto.storage.cache.cached_file_is_current(path, expected_size)#
Report whether the cached file can be reused instead of downloaded again.
Returns
Truewhen the file exists and either its size matchesexpected_sizeorexpected_sizeisNone. A size mismatch means the file’s content changed since the cache was written, so the cached bytes are stale and must not be reused; the file is then treated as absent and re-downloaded.expected_sizeofNone(no size reported) reuses any existing file.- Parameters:
path (pathlib.Path) – Local cache path to check.
expected_size (Optional[int]) – The backing file’s current size in bytes, or
Nonewhen unknown.
- Returns:
Truewhen the cached file may be reused,Falsewhen it is missing or stale.- Return type:
bool
- roboto.storage.cache.choose_fetch_mode(policy, already_cached, estimated_column_count)#
Pick the cheapest way to open a remote columnar file under a cache policy.
A previously-downloaded file is always cheaper to open than streaming over HTTP, regardless of how many columns the current read projects, so an existing cached copy wins under every policy except
NEVER(which never touches the disk cache at all, not even to read it).- Parameters:
policy (CachePolicy) – The caller’s cache policy.
already_cached (bool) – Whether a complete copy already exists in the local cache.
estimated_column_count (int) – How many columns the read is expected to project; compared against
COLUMN_COUNT_LOCAL_CACHE_THRESHOLDunderADAPTIVE.
- Returns:
The fetch mode to use.
- Return type:
- roboto.storage.cache.download_to_cache(url_provider, outfile, expected_size=None)#
Download a remote file to
outfilesafely under concurrency.Acquires a per-path lock to dedupe in-process downloads, double-checks existence (another thread may have completed the download while we were waiting), creates the cache directory lazily, and writes via a uniquely named
.partfile followed byos.replace(). The rename is atomic on POSIX and Windows, so:Readers never see a partial file at
outfile.Two processes racing both produce a complete file; the second
os.replacesimply overwrites the first.If the download raises, the
.partfile is removed andoutfileis left untouched.
When
expected_sizeis supplied, the written.partfile’s size is verified against it before the atomic rename. A truncated body — whichurllib.request.urlretrieve()does not flag when the response carries noContent-Length— fails this check, so the.partis discarded and nothing is promoted to the cache: a partial download is never made sticky.- Parameters:
url_provider (Callable[[], str]) – Resolves the download URL. Called only when the download actually proceeds, so a signed URL is not minted for a file that turns out to already be cached.
outfile (pathlib.Path) – Final cache path for the downloaded file.
expected_size (Optional[int]) – The backing object’s size in bytes when the server reports it;
Noneskips the completeness check.
- Raises:
ValueError –
expected_sizeis supplied and the downloaded file’s size does not match it (a truncated or otherwise incomplete download).- Return type:
None
- roboto.storage.cache.get_download_lock(key)#
Return the in-process download lock for a cache path, creating it on first use.
- Parameters:
key (str)
- Return type:
threading.Lock
- roboto.storage.cache.logger#