DataCaches.jl

A lightweight, file-backed key-value cache for Julia for workflows that make frequent time-, internet or network bandwidth expensive function calls (remote API queries, long-running computations) and need results to be available across Julia sessions.

DataCaches selects a transparent, inspectable storage format automatically based on the data type, falling back to Julia's binary serialization for types without a dedicated format:

Data type	Format	File	Version-stable?
`DataFrame`, Tables.jl-compatible	CSV	`.csv`	Yes
`NamedTuple`	JSON	`.json`	Yes (JSON-primitive values)
Anything else	Julia serialization	`.jls`	No

The storage format is recorded per entry so the correct deserializer is always used on read, regardless of the Julia version. Custom serializers can be registered for additional types via DataCaches.register_serializer!.

Three levels of caching are provided, from lightest-weight to most manual:

Level	Mechanism	Persistence	Works with any function?
Memoized	`@filecache`	Across sessions	Yes
Refresh	`@filecache!`	Across sessions	Yes
Memoized	`@memcache`	In-session only	Yes
Explicit	`dc["label"] = result`	Across sessions	Yes
Automatic	`set_autocaching!`	Across sessions	Only if instrumented

Purpose

The purpose of this package is to provide a persistent, file-backed key-value store for arbitrary Julia objects, keyed by user-assigned labels or auto-generated argument hashes. This enables short-circuiting of expensive function calls by returning stored results instead of recomputing repeated calls across Julia sessions while also providing a portable, inspectable cache that can be shared across users or systems without requiring database infrastructure.

This package also provides mechanisms allowing library developers to patch in support for a fully transparent, under-the-hood auto-caching layer that requires no changes to user-facing call syntax. This keeps exploratory and instructional code clean and readable, with caching remaining invisible in automatic mode and introducing no modifications to program logic or presentation.

Features

DataCaches.jl provides three complementary interfaces aligned with its purpose:

a lightweight memoization mechanism, not requiring any function instrumentation or wrapping, enabling selective, automated caching of any function call (keyed on the runtime values of all arguments)

# If the active disk cache does not have this particular 
# combination of function name and argument values stored,
# then the function will be evaluated, cached, and returned.
foo = @filecache func1(x, y) 
# Function not evaluated; cached result returned
bar = @filecache func1(x, y) 
# Unconditionally re-execute and overwrite the cache entry
foo = @filecache! func1(x, y)

a straightforward Dict-style API for explicit, manual cache control

# Just like a `Dict`, but auto-persists across sessions.
cache["fig1"] = plot(...)
fig1 = cache["fig1"]

a fully seamless mode in which specially-instrumented or wrapped function calls are cached on first execution and transparently retrieved thereafter, with no changes to call sites (see Pattern 3 — Automatic caching)

# Once functions are instrumented (or wrapped), enable automatic caching:
set_autocaching!(true)
foo = func1(x, y)   # fetches + stores
bar = func1(x, y)   # instant, from cache — call site unchanged

with the following design principles:

Syntactically lightweight or (almost) invisible.
Seamless integration into REPL- or script-based workflow without requiring any change of logic or structure.
Straight-forward, flexible, and completely transparent management of cache store, with views and data accessible not only through Julia for convenience, but also through standard file-system tools.
Yet, cache store setup and management is completely optional, and novice users need not even be aware of its existence or operation.
The cache store and usage persists across Julia sessions (i.e., not in-memory only, though that is supported).
A particular cache store file-system directory can be shared across different computing systems or users by copying, cloning, or as an compressed archive.

Installation

At the Julia REPL, type "]" to switch into Package manager mode and then type:

pkg> add DataCaches

Or, either in the Julia REPL or a script:

using Pkg
Pkg.add("DataCaches")

Or, if you want the latest development version from the source repository:

using Pkg
Pkg.add(url = "https://github.com/JuliaData/DataCaches.jl")

Quick Start

Cache setup approaches

There is a gradient from no setup approach to full control.

The "No-setup" setup: the default cache

In the no-setup approach, we do not explicitly open a cache before running any of the cache operations, and a default ":_DEFAULT" cache will be used automatically, so this step can be skipped.

The default file cache is the cache that will be used if we do NOT specify a cache explicitly is given by

using DataCaches
dc = DataCaches.default_filecache()

The default cache path will be located in the DataCaches module-scoped scratchspace in the Julia depot directory,

/home/username/.julia/scratchspaces/c1455f2b-6d6f-4f37-b463-919f923708a5/caches/user/_DEFAULT

and is equivalent to the user creating a cache named ":_DEFAULT" using public cache creation mechanics:

dc = DataCache(:_DEFAULT)

A named cache in the `DataCaches` scratchspace depot

This approach create caches that live inside DataCaches.jl's own depot, siloed from each other and the default ::_DEFAULT".

All these caches are automatically deleted if this package is uninstalled and Pkg.gc() are run.

Individual users can name individual caches:

using DataCaches

# for separating projects specific 
dc = DataCache(:project123)
# store: /home/username/.julia/scratchspaces/c1455f2b-6d6f-4f37-b463-919f923708a5/caches/user/project123
dc = DataCache(:gbifdata)
# store: /home/username/.julia/scratchspaces/c1455f2b-6d6f-4f37-b463-919f923708a5/caches/user/gbifdata
dc = DataCache(:mcmcruns)
# store: /home/username/.julia/scratchspaces/c1455f2b-6d6f-4f37-b463-919f923708a5/caches/user/mcmcruns

Package authors (or users) can have module-space specific silos:

using DataCaches
dc = DataCaches.scratch_datacache!(MyPackage_UUID, :rasterdata)
# store: /home/username/.julia/scratchspaces/c1455f2b-6d6f-4f37-b463-919f923708a5/caches/module/<MyPackage_UUID>/rasterdata

A cache in an arbitrary filesystem path

To locate a cache outside of this package's scratchspace depot, for easier or customized file-system management, or for cache assets to persist even if this package is uninstalled, provide any writeable path as a string.

using DataCaches
# Explicit path, for a cache open to 
# non-hidden file-system views.
dc = DataCache(joinpath(homedir(), "shared", "data", "downloads"))
dc = DataCache("/tmp/workshop/data"))

Caching approaches

Again, different approaches providing different levels of automation vs. control.

using DataCaches, PaleobiologDB

# Optional: show caching operations in debug logs
ENV["JULIA_DEBUG"] = "DataCaches"

# Here we use a siloed project specific cache.
dc = DataCache(:myproject)
set_default_filecache!(dc)

# If we did not run `set_default_filecache!` above
# , then the default cache will be used in all 
# the patterns below, with `dc` being given by:
# dc = DataCaches.default_filecache()

# --- Pattern 1: @filecache — works with any function, no setup beyond the cache ---
occs = @filecache pbdb_occurrences(base_name = "Canidae", show = "full")  # fetches + stores
occs = @filecache pbdb_occurrences(base_name = "Canidae", show = "full")  # from cache

# --- Pattern 2: explicit dict-style — full control over labels and timing ---
dc["canidae_occs"] = pbdb_occurrences(base_name = "Canidae", show = "full")
occs = dc["canidae_occs"]

# --- Pattern 3: set_autocaching! — zero call-site changes, but requires
#     instrumented or wrapped functions (see Pattern 3 section below) ---
set_autocaching!(true; cache = dc)
occs = pbdb_occurrences(base_name = "Canidae", show = "full")   # fetches + stores
occs = pbdb_occurrences(base_name = "Canidae", show = "full")   # from cache, unchanged call
set_autocaching!(false)

More details on these usage patterns can be found in the note on usage patterns.

Comparison of caching strategies

	`@filecache`	`@filecache!`	`@memcache`	`dc["label"] = ...`	`set_autocaching!`
Persists across sessions	Yes	Yes	No	Yes	Yes
Works with any library	Yes	Yes	Yes	Yes	Only if instrumented (or wrapped)
Changes call sites	Yes	Yes	Yes	Yes	No
Label is human-readable	Hash	Hash	Hash	Yes	Hash
Always re-executes	No	Yes	No	Yes	With `force_refresh = true`
Granularity	Per macro site	Per macro site	Per macro site	Any	Per function

Documentation

The full API reference is hosted online at https://juliadata.org/DataCaches.jl.

To build the documentation locally, run from the repository root:

# One-time setup
julia --project=docs -e 'using Pkg; Pkg.instantiate()'
# Build
julia --project=docs docs/make.jl

The generated site is written to docs/build/. Open docs/build/index.html in a browser to view it. See docs/README.md for details.

Testing

Run the full test suite with:

julia -e 'import Pkg; Pkg.test("DataCaches")'

Or, in package manager REPL mode (]):

pkg> test DataCaches

See test/README.md for more options.

Named depot caches

Pass a Symbol to DataCache to create a named user cache inside DataCaches.jl's own depot directory. No path management or UUIDs required, and the cache is automatically removed if DataCaches.jl is uninstalled:

dc = DataCache(:myproject)

The store lives at ~/.julia/scratchspaces/<DataCaches-UUID>/caches/user/myproject/. Multiple independent stores are created by using different symbols:

queries = DataCache(:pbdb_queries)
taxa    = DataCache(:taxonomy)

This form is also convenient for library authors who want a lifecycle-managed cache without introducing their own path management:

const _CACHE = Ref{Union{DataCache,Nothing}}(nothing)
function __init__()
    _CACHE[] = DataCache(:mypackage_results)
end

Inspecting entries — `entries`, `entry`, `labels`

These exported functions provide the primary API for examining what is stored in a cache. No submodule import is needed.

using DataCaches

dc = DataCache(:myproject)

# --- Get all entries (returns Vector{CacheEntry}) ---
all     = entries(dc)
labeled = entries(dc; labeled = true)          # only entries with a user-assigned label
recent  = entries(dc; after = DateTime("2026-01-01T00:00:00"))
lru     = entries(dc; sortby = :dateaccessed_desc)   # oldest-accessed first (LRU)
big     = entries(dc; sortby = :size_desc)            # largest first
found   = entries(dc; pattern = r"canidae")           # regex on label / description
entries()                                             # default cache

# --- Get a single entry by label or sequence index ---
e = entry(dc, "canidae_occs")   # by label  → CacheEntry (throws KeyError if absent)
e = entry(dc, 3)                # by seq    → CacheEntry
e = entry("canidae_occs")       # default cache

# Use the entry to read data, delete, relabel, etc.
data = read(dc, e)
delete!(dc, e)
relabel!(dc, e, "canidae_occurrences")

# --- Get all user-assigned labels ---
lbls = labels(dc)    # → Vector{String}, no empty strings
labels()             # default cache

Each CacheEntry has these fields:

Field	Type	Description
`e.id`	`String`	UUID (unique identifier)
`e.seq`	`Int`	Stable integer index shown by `showcache`
`e.label`	`String`	User-assigned label (empty if none)
`e.path`	`String`	Absolute path to the backing file
`e.format`	`String`	Storage format tag: `"csv"`, `"json"`, `"jls"`, `"png"`, …
`e.description`	`String`	Source expression (from `@filecache`; empty if none)
`e.datecached`	`DateTime`	When the entry was written
`e.dateaccessed`	`DateTime`	When the entry was last read

Backward compatibility: CacheKey is a silent alias for CacheEntry. Code written against earlier releases continues to work unchanged.

CacheAssets — managing assets within a cache

DataCaches.CacheAssets is a submodule that provides a filesystem-style interface for inspecting and managing individual entries within a DataCache. It is public but not exported; use using DataCaches.CacheAssets to bring it into scope.

All functions accept an optional leading DataCache argument. When omitted, default_filecache() is used.

using DataCaches
using DataCaches.CacheAssets
using Dates

dc = DataCache(:myproject)

# --- List (data) — same filter/sort kwargs as entries() ---
all_entries = CacheAssets.ls(dc)                                 # → Vector{CacheEntry}
all_entries = CacheAssets.ls(dc; pattern = r"canidae")           # filter by label/description
all_entries = CacheAssets.ls(dc; sortby = :dateaccessed_desc)    # LRU: oldest access first
all_entries = CacheAssets.ls(dc; sortby = :size_desc)            # largest first
all_entries = CacheAssets.ls(dc; after = DateTime("2026-01-01T00:00:00"), labeled = true)

# Filter by file path / filename
csv_entries  = CacheAssets.ls(dc; filename_pattern = r"\.csv$")          # CSV-backed only
proj_entries = CacheAssets.ls(dc; filepath_pattern = r"/myproject/")     # path substring
stale        = CacheAssets.ls(dc; before = DateTime("2026-01-01T00:00:00"))  # old entries

# Filter by access date
hot  = CacheAssets.ls(dc; accessed_after_date  = DateTime("2026-03-01T00:00:00"))
cold = CacheAssets.ls(dc; accessed_before_date = DateTime("2026-01-01T00:00:00"))

# --- List (display) ---
CacheAssets.ls!(dc)                                # normal detail: seq, timestamp, label, path
CacheAssets.ls!(dc; detail = :minimal)             # seq + label only
CacheAssets.ls!(dc; detail = :full)                # + access time, file size, format
CacheAssets.ls!(dc; pattern = r"canidae")          # filter by label/description (same as ls)
CacheAssets.ls!(dc; filename_pattern = r"\.csv$")  # CSV-backed entries
CacheAssets.ls!(dc; sortby = :dateaccessed_desc)   # LRU: oldest access first
CacheAssets.ls!(dc; sortby = :size_desc)           # largest first
CacheAssets.ls!(dc; io = my_io)                    # redirect output

# --- Remove ---
CacheAssets.rm(dc, "old_label")                    # by label
CacheAssets.rm(dc, 2)                              # by sequence index
CacheAssets.rm(dc, "label1", "label2", 5)          # multiple assets, single index rewrite

# Vector form — handy with ls results (one batched index rewrite)
CacheAssets.rm(dc, stale)                          # remove all stale entries
CacheAssets.rm(dc, [entry1, "label2", 5])          # mixed specifier types
CacheAssets.rm(dc, ["maybe1", "maybe2"]; force = true)  # skip unresolvable

# delete! also accepts a vector directly on the DataCache
delete!(dc, stale)
delete!(dc, ["canidae_occs", 3, entry1])

# --- Relabel within a cache ---
CacheAssets.mv(dc, "old_label", "new_label")
CacheAssets.mv(dc, 3, "new_label")                 # by sequence index

# --- Move to another cache ---
dc2 = DataCache(:archive)
CacheAssets.mv(dc, "canidae_occs", dc2)
CacheAssets.mv(dc, "canidae_occs", dc2; label = "canidae_archived")

# --- Copy to another cache ---
CacheAssets.cp(dc, "canidae_occs", dc2)
CacheAssets.cp(dc, ["canidae_occs", "dino_taxa"], dc2)   # multiple assets

# --- Default cache (omit the DataCache argument) ---
CacheAssets.ls()                                   # → Vector{CacheEntry}
CacheAssets.ls!()                                  # prints to stdout
CacheAssets.rm("stale_entry")
CacheAssets.rm(stale)                              # vector form also works
CacheAssets.mv("old", "new")

Access-time tracking

By default, every read updates the dateaccessed timestamp on each entry's CacheEntry, enabling LRU inspection and future pruning. This requires rewriting the cache index on every read. For caches that are read very frequently or that contain many entries, opt out by constructing the cache with track_access = false:

dc = DataCache(:high_frequency; track_access = false)

Caches — managing named caches

DataCaches.Caches is a submodule that provides a filesystem-style interface for browsing and managing the caches (rather than assets within a particular cache) that live in the DataCaches scratchspace (~/.julia/scratchspaces/<DataCaches-UUID>/). It is public but not exported; access it as DataCaches.Caches.

The scratchspace uses a structured subdirectory layout:

~/.julia/scratchspaces/<DataCaches-UUID>/
  caches/
    user/
      _DEFAULT/             ← DataCache() / DataCache(:_DEFAULT) default store
      <name>/              ← DataCache(:name) stores
    module/<uuid>/<key>/   ← scratch_datacache!(uuid, key) stores

using DataCaches

# Inspect the scratchspace
DataCaches.Caches.pwd()           # → "/home/user/.julia/scratchspaces/c1455f2b-..."
DataCaches.Caches.defaultstore()  # → ".../c1455f2b-.../caches/user/_DEFAULT"
DataCaches.Caches.ls()            # → [:user, :module]                          (caches root — default)
DataCaches.Caches.ls(:user)       # → [:_DEFAULT, :myproject, :taxonomy, ...]    (user stores)
DataCaches.Caches.ls(:module)     # → [Symbol("uuid1/key1"), ...]               (module stores)
DataCaches.Caches.ls!()           # prints caches root to stdout
DataCaches.Caches.ls!(:user)      # prints user store names to stdout
DataCaches.Caches.ls!(:user; io = my_io)  # redirect output

# Create named caches as usual, then manage them through Caches
queries = DataCache(:myproject)
taxa    = DataCache(:taxonomy)

# Rename within scratchspace
DataCaches.Caches.mv(:myproject, :archived_project)

# Copy within scratchspace
DataCaches.Caches.cp(:taxonomy, :taxonomy_backup)

# Export to / import from the filesystem
DataCaches.Caches.mv(:archived_project, "/data/exports/myproject")  # move out
DataCaches.Caches.mv("/data/imports/shared_cache", :shared)         # move in
DataCaches.Caches.cp(:taxonomy, "/tmp/taxonomy_snapshot")           # copy out

# Remove
DataCaches.Caches.rm(:taxonomy_backup)
DataCaches.Caches.rm(:nonexistent; force=true)  # silently ignore if absent

See the full API reference for complete documentation of each function.

Cache expiration and invalidation

DataCaches provides first-class invalidation — TTL, stale detection, bulk removal, and automatic post-write purge policies.

TTL and staleness

using DataCaches, Dates

# Per-entry TTL
dc = DataCache(:myproject)
write!(dc, result; label = "query1", ttl = Hour(6))

# Check before re-using
if isstale(dc, "query1")
    result = fetch_live()
    write!(dc, result; label = "query1", ttl = Hour(6))
end

# Cache-level default TTL (persisted across sessions)
dc = DataCache(:myproject; default_ttl = Day(1))
write!(dc, result; label = "query2")   # inherits Day(1)

isstale is purely informational — read never blocks on staleness, enabling stale-while-revalidate patterns.

Bulk invalidation

# Remove all stale entries
invalidate!(dc; stale = true)

# Remove by label pattern
invalidate!(dc; pattern = r"^temp_")

# Remove by format
invalidate!(dc; format = "jls")

# Remove by predicate
invalidate!(dc; predicate = e -> startswith(e.description, "old_func("))

# Preview without deleting
invalidate!(dc; stale = true, dry_run = true)

Power purging

using DataCaches.CacheAssets

# LRU: keep only the 20 most recently accessed entries
purge!(dc; keep_count = 20)

# Remove entries older than 30 days
purge!(dc; max_age = Day(30))

# Limit total cache size to 500 MiB (LRU eviction)
purge!(dc; max_size_bytes = 500 * 1024 * 1024)

# Dry-run
purge!(dc; max_age = Day(7), dry_run = true)

Auto-purge policy

# Keep only the 50 most recently used entries — enforced automatically on write
set_autopurge!(dc; keep_count = 50)
write!(dc, new_result; label = "latest")   # purge fires here

# Combined: age + count, protect labeled entries
set_autopurge!(dc; max_age = Day(30), keep_count = 100, keep_labeled = true)

# Disable
set_autopurge!(dc; enabled = false)

About

This package addresses a general need for disk-based memoization and caching in contexts such as analytics, informatics, and software development, where identical database queries or computationally expensive functions are executed repeatedly and expected to return stable results between manual cache refreshes. It is broadly applicable, but its combination of flexible caching mechanisms and minimal syntactic overhead makes it particularly effective for a specific class of problems not well handled by existing tools.

However, in addition, its broad range of caching mechanisms and syntax makes it uniquely suited to solve one class of problems that none of the other offerings out there could do in quite this way.

A primary use case arises in instructional settings (labs, workshops, and courses) where many users simultaneously issue repeated database queries, often overwhelming shared resources such as the database itself or available network bandwidth. By memoizing these calls and persisting results to disk, the package substantially reduces this load. In constrained environments with limited or unreliable connectivity, caches can be precomputed and distributed with course materials, allowing code to run with little to no modification. In automatic modes, the caching layer remains effectively invisible, preserving the clarity and integrity of the instructional code.

The design prioritizes lightweight, unobtrusive integration into REPL and script workflows, requiring no changes to program logic or structure. Cache storage is fully transparent and accessible both programmatically and via the file system, yet entirely optional—novice users can remain unaware of its existence. Caches persist across sessions and can be shared across systems by copying or archiving the underlying directory.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github		.github
docs		docs
src		src
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
Project.toml		Project.toml
README.md		README.md
ROADMAP.md		ROADMAP.md
STYLE-julia.md		STYLE-julia.md
usage-patterns.md		usage-patterns.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataCaches.jl

Purpose

Features

Installation

Quick Start

Cache setup approaches

The "No-setup" setup: the default cache

A named cache in the `DataCaches` scratchspace depot

A cache in an arbitrary filesystem path

Caching approaches

Comparison of caching strategies

Documentation

Testing

Named depot caches

Inspecting entries — `entries`, `entry`, `labels`

CacheAssets — managing assets within a cache

Access-time tracking

Caches — managing named caches

Cache expiration and invalidation

TTL and staleness

Bulk invalidation

Power purging

Auto-purge policy

About

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataCaches.jl

Purpose

Features

Installation

Quick Start

Cache setup approaches

The "No-setup" setup: the default cache

A named cache in the DataCaches scratchspace depot

A cache in an arbitrary filesystem path

Caching approaches

Comparison of caching strategies

Documentation

Testing

Named depot caches

Inspecting entries — entries, entry, labels

CacheAssets — managing assets within a cache

Access-time tracking

Caches — managing named caches

Cache expiration and invalidation

TTL and staleness

Bulk invalidation

Power purging

Auto-purge policy

About

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

A named cache in the `DataCaches` scratchspace depot

Inspecting entries — `entries`, `entry`, `labels`

Packages