R bindings to the Rust
object_store crate, providing
a uniform interface to local filesystems and cloud object stores. Built
with extendr.
The same small API (store_put, store_get, store_get_range,
store_list, store_exists, store_delete, store_copy) works across
every backend — switching between in-memory, local disk, AWS S3,
S3-compatible endpoints like Pawsey or source.coop, Google Cloud
Storage, and Azure Blob Storage is just a change of constructor.
For cloud workloads, store_get_many() and store_get_ranges_many()
fan requests out concurrently through a shared tokio runtime, delivering
10–30× speedups over sequential per-request calls.
You will need a working Rust toolchain (see rustup.rs). Then:
# install.packages("remotes")
remotes::install_github("mdsumner/robstore")| Constructor | Use for |
|---|---|
memory_store() |
in-memory (tests, scratch) |
local_store(path) |
local filesystem rooted at path |
s3_store(...) |
AWS S3 or S3-compatible with credentials |
s3_store_anonymous(...) |
public S3 buckets (e.g. sentinel-cogs, source.coop) |
gcs_store(bucket) |
Google Cloud Storage with credentials (GOOGLE_APPLICATION_CREDENTIALS) |
gcs_store_anonymous(bucket) |
public GCS buckets (e.g. gcp-public-data-arco-era5) |
azure_store(account, container) |
Azure Blob Storage with env-var credentials |
azure_store_sas(account, container, sas_token) |
Azure with a SAS token (e.g. Microsoft Planetary Computer) |
azure_store_anonymous(account, container) |
Azure unsigned (limited — no anonymous listing) |
Generic HTTP is planned.
library(robstore)
s <- memory_store()
s
#> <MemoryStore>
store_put(s, "hello.txt", charToRaw("Hello, world!"))
rawToChar(store_get(s, "hello.txt"))
#> [1] "Hello, world!"
store_list(s, prefix = NULL)
#> [1] "hello.txt"
store_exists(s, "hello.txt")
#> [1] TRUE
# byte-range read
store_get_range(s, "hello.txt", offset = 7, length = 5) |> rawToChar()
#> [1] "world"
store_copy(s, "hello.txt", "bye.txt")
store_delete(s, "hello.txt")
store_list(s, prefix = NULL)
# key size last_modified etag
#1 bye.txt 13 2026-04-26 19:21:15 1tmp <- tempfile("rs-"); dir.create(tmp)
ls <- local_store(tmp)
ls
#> <LocalStore(/tmp/rs-abc123)>
# nested keys work — intermediate directories are created automatically
store_put(ls, "a/b/c.txt", charToRaw("nested"))
list.files(tmp, recursive = TRUE)
#> [1] "a/b/c.txt"
rawToChar(store_get(ls, "a/b/c.txt"))
#> [1] "nested"A Sentinel-2 L2A scene from the public Element 84 archive, listed anonymously:
s <- s3_store_anonymous(
bucket = "sentinel-cogs",
region = "us-west-2",
endpoint = NULL,
allow_http = FALSE
)
s
#> <S3Store[anon](sentinel-cogs @ us-west-2)>
keys <- store_list(s, prefix = "sentinel-s2-l2a-cogs/1/C/CV/2024/1/")
nrow(keys)
#> [1] 63
head(keys$key, 3)
#> [1] "sentinel-s2-l2a-cogs/1/C/CV/2024/1/S2B_1CCV_20240106_0_L2A/AOT.tif"
#> [2] "sentinel-s2-l2a-cogs/1/C/CV/2024/1/S2B_1CCV_20240106_0_L2A/B01.tif"
#> [3] "sentinel-s2-l2a-cogs/1/C/CV/2024/1/S2B_1CCV_20240106_0_L2A/B02.tif"For credentialed S3-compatible services (Pawsey, MinIO, Backblaze etc.),
point the endpoint argument at the provider and supply credentials via
the usual AWS environment variables:
# credentials only needed for writes or private buckets
Sys.setenv(
AWS_ACCESS_KEY_ID = "<your-key-id>",
AWS_SECRET_ACCESS_KEY = "<your-secret>"
)
s <- s3_store_anonymous(
bucket = "estinel",
region = "", # unused when endpoint is set
endpoint = "https://projects.pawsey.org.au",
allow_http = FALSE
)
store_list(s, prefix = "sentinel-2-c1-l2a/2015")
#key size last_modified etag
#1 sentinel-2-c1-l2a/2015/10/23/Hobart_2015-10-23.tif 1721446 2025-11-18 02:06:20 "5ead10131897c6e441582c3e7ee86706"
#2 sentinel-2-c1-l2a/2015/11/12/Hobart_2015-11-12.tif 1184485 2025-11-18 02:06:20 "3bd9597091ff283081d93f89a8eeb6e7"
#3 sentinel-2-c1-l2a/2015/11/22/Hobart_2015-11-22.tif 1749887 2025-11-18 02:06:21 "d1b0a8fc089faf18a5c5ee8b7a74d5aa"
#4 sentinel-2-c1-l2a/2015/12/19/Hobart_2015-12-19.tif 12070 2025-11-18 02:06:20 "f04501a5d853e62f05bda75aeab4cdc9"
# ...S3’s ListObjectsV2 is paginated (1000 keys per page) and each page
depends on the previous one’s continuation token, so a single large
listing is serial by protocol. For hierarchical layouts — years, MGRS
tiles, product types — you can fan out across sub-prefixes with
store_list_many() and turn one long serial listing into many short
parallel ones.
Pawsey’s estinel bucket is public, so s3_store_anonymous() works
against it too — no credentials needed:
s <- s3_store_anonymous(
bucket = "estinel",
region = "",
endpoint = "https://projects.pawsey.org.au",
allow_http = FALSE
)
# serial — one paginated listing through the whole bucket
system.time({
all_serial <- store_list(s, prefix = "sentinel-2-c1-l2a/")
})
#> user system elapsed
#> 2.441 0.657 43.132
length(all_serial)
#> [1] 552860
# parallel — one list call per year, fanned out
year_prefixes <- sprintf("sentinel-2-c1-l2a/%d/", 2015:2026)
system.time({
all_parallel <- store_list_many(s, year_prefixes, concurrency = 12)
})
#> user system elapsed
#> 2.821 0.651 8.469
nrow(all_parallel)
#> [1] 552860A 5× speedup over the serial listing, with the same 552,860 keys
returned (store_list_many() does not preserve input order — sort the
result if order matters).
source.coop is a multi-tenant S3 gateway: one
large public bucket (us-west-2.opendata.source.coop) with each data
provider getting a top-level prefix. From robstore’s perspective it’s
just an anonymous AWS S3 bucket — the bucket name has dots in it, but
that’s the only unusual thing:
s <- s3_store_anonymous(
bucket = "us-west-2.opendata.source.coop",
region = "us-west-2",
endpoint = NULL,
allow_http = FALSE
)
s
#> <S3Store[anon](us-west-2.opendata.source.coop @ us-west-2)>
store_list_delimited(s, "ausantarctic/ghrsst-mur-v2/")
#> $keys
#> [1] "ausantarctic/ghrsst-mur-v2/README.md"
#> [2] "ausantarctic/ghrsst-mur-v2/ghrsst-mur-v2.parquet"
#>
#> $common_prefixes
#> [1] "ausantarctic/ghrsst-mur-v2/2002"
#> [2] "ausantarctic/ghrsst-mur-v2/2003"
#> ...
#> [25] "ausantarctic/ghrsst-mur-v2/2026"
# fan-out list across years — same pattern as Pawsey
year_prefixes <- sprintf("ausantarctic/ghrsst-mur-v2/%d/", 2002:2026)
system.time({
all_keys <- store_list_many(s, year_prefixes, concurrency = 16)
})
#> user system elapsed
#> 0.297 0.175 2.266
dim(all_keys)
#> [1] 46124
# filter to the TIFs
tif_keys <- grepv("\\.tif$", all_keys$key)
tail(tif_keys)
#> [1] "ausantarctic/ghrsst-mur-v2/2026/04/15/20260415090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1_sst_anomaly.tif"
#> ...store_get_range() is fast — reqwest over rustls with connection
pooling — but calling it in an R loop is still one request at a time.
store_get_ranges_many() issues up to concurrency range-GET requests
simultaneously through a shared tokio runtime, returning a list of raw
vectors in input order. This is the primitive for COG IFD walks, Zarr
chunk scatter-reads, and Parquet row-group reads.
s <- s3_store_anonymous("sentinel-cogs", "us-west-2", NULL, FALSE)
keys <- store_list(s, prefix = "sentinel-s2-l2a-cogs/1/C/CV/2024/1/")
length(keys)
#> [1] 63
# sequential — one GET at a time
system.time({
heads_seq <- lapply(keys, \(k) store_get_range(s, k, 0, 65536))
})
#> user system elapsed
#> 0.026 0.065 19.333
# concurrent fan-out — 16 inflight
system.time({
heads <- store_get_ranges_many(
s,
keys = keys,
offsets = rep(0, length(keys)),
lengths = rep(65536, length(keys)),
concurrency = 16
)
})
#> user system elapsed
#> 0.036 0.044 1.625
# pushing concurrency higher
system.time({
heads_32 <- store_get_ranges_many(
s, keys,
rep(0, length(keys)), rep(65536, length(keys)),
concurrency = 32
)
})
#> user system elapsed
#> 0.033 0.052 0.767Note the user/system columns: essentially zero CPU work in R while
Rust drives the reactor.
For the common pattern of reading the first N bytes of many files (COG
headers, Parquet footers, Zarr .zarray files), store_head_bytes()
wraps store_get_ranges_many() with fixed offset and length:
hdrs <- store_head_bytes(s, keys, length = 65536, concurrency = 32)
length(hdrs)
#> [1] 63
# pass directly to your COG parser / IFD scanner of choiceAt the time of writing, store_list() can return half a million keys
from a single prefix call (tested against sentinel-s2-l2a-cogs/1/C/
which returned 543,889 keys). For concurrent byte-range reads at that
scale, batch the keys in R rather than holding all the returned raw
vectors in memory at once:
keys_huge <- store_list(s, prefix = "sentinel-s2-l2a-cogs/1/C/")
nrow(keys_huge)
#> [1] 543889
# process in chunks of 1000
batches <- split(keys_huge, ceiling(seq_along(keys_huge$key) / 1000))
for (batch in batches) {
hdrs <- store_head_bytes(s, batch$key, length = 16384, concurrency = 64)
# ... extract IFD offsets, write references, etc.
}gcs_store_anonymous() opens a public Google Cloud Storage bucket with
unsigned requests. Combined with store_list_delimited() it’s an
efficient way to explore hierarchical archives like the Landsat mirror:
s <- gcs_store_anonymous("gcp-public-data-landsat")
store_list_delimited(s, NULL)
#> $keys
#> [1] "index.csv.gz"
#>
#> $common_prefixes
#> [1] "LC08" "LE07" "LM01" "LM02" "LM03" "LM04" "LM05" "LO08" "LT04" "LT05" "LT08"
store_list_delimited(s, "LC08/01/")$common_prefixes |> length()
#> [1] 233 # WRS-2 path directories
store_list_delimited(s, "LC08/01/090/")$common_prefixes |> length()
#> [1] 67 # rows within path 090For analysis-ready Zarr stores, the cloud-native pattern is to fetch the
consolidated metadata and compute chunk keys rather than listing them.
The ERA5 archive on GCS exposes .zmetadata as a single ~130 KB JSON
file that describes the entire store:
era <- gcs_store_anonymous("gcp-public-data-arco-era5")
meta <- store_get(era, "ar/full_37-1h-0p25deg-chunk-1.zarr-v3/.zmetadata")
length(meta)
#> [1] 132785
cat(substring(rawToChar(meta), 1, 400))
#> {"metadata": {".zattrs": {"valid_time_start": "1940-01-01",
#> "last_updated": "2026-04-17 02:54:09...",
#> "valid_time_stop": "2025-12-31", ...},
#> "100m_u_component_of_wind/.zarray": {
#> "chunks": [1, 721, 1440],
#> "compressor": {"cname": "lz4", "id": "blosc", ...},
#> "dtype": "<f4",
#> "shape": [1323648, 721, 1440], ...}}That 130 KB JSON describes 85 years of hourly global reanalysis across hundreds of variables. The typical workflow is:
store_get()the consolidated metadata (one request).- Parse the JSON in R to learn shape, chunk grid, compressor, dtype.
- Compute exactly the chunk keys you need for your
(variable, time, lat_range, lon_range)window. store_get_many()those chunks concurrently.
No listing of chunks — the metadata tells you the keys directly.
robstore provides the concurrent byte-fetch primitives; a Zarr-aware
layer (e.g. a future zaro integration) handles the codec pipeline and
array assembly.
Azure’s auth model differs from S3 and GCS: anonymous listing is generally not permitted even on “public” containers. Most public Azure datasets — the Microsoft Planetary Computer collections in particular — are accessed via short-lived SAS tokens from a free token-minting endpoint:
https://planetarycomputer.microsoft.com/api/sas/v1/token/{account}/{container}
Fetch a token, hand it to azure_store_sas(), and listing and reads
work the same as any other backend:
library(httr2)
# Free SAS token from Planetary Computer (~1 hour lifetime)
token <- request(
"https://planetarycomputer.microsoft.com/api/sas/v1/token/sentinel1euwestrtc/sentinel1-grd-rtc"
) |>
req_perform() |>
resp_body_json()
s <- azure_store_sas(
account = "sentinel1euwestrtc",
container = "sentinel1-grd-rtc",
sas_token = token$token
)
s
#> <AzureStore[sas](sentinel1euwestrtc/sentinel1-grd-rtc)>
# Navigate the Sentinel-1 GRD layout with delimited listings
store_list_delimited(s, NULL)$common_prefixes
#> [1] "GRD"
store_list_delimited(s, "GRD/")$common_prefixes
#> [1] "GRD/2014" "GRD/2015" "GRD/2016" "GRD/2017" "GRD/2018" "GRD/2019"
#> [7] "GRD/2020" "GRD/2021" "GRD/2022" "GRD/2023" "GRD/2024" "GRD/2025"
#> [13] "GRD/2026"
store_list_delimited(s, "GRD/2024/")$common_prefixes
#> [1] "GRD/2024/1" "GRD/2024/10" "GRD/2024/11" "GRD/2024/12"
#> [5] "GRD/2024/2" "GRD/2024/3" ...For your own Azure storage accounts, set the usual environment variables
and use azure_store():
Sys.setenv(
AZURE_STORAGE_ACCOUNT_NAME = "myaccount",
AZURE_STORAGE_ACCOUNT_KEY = "..."
)
s <- azure_store("myaccount", "mycontainer")azure_store_anonymous() exists for completeness but will fail on most
containers — Azure does not generally permit unsigned listing on public
data.
All operations take a Store object as their first argument.
| Function | Description |
|---|---|
store_put(store, key, data) |
write a raw vector |
store_get(store, key) |
read an object as raw |
store_get_range(store, key, offset, length) |
read a byte range |
store_exists(store, key) |
TRUE/FALSE |
store_list(store, prefix) |
character vector of keys |
store_copy(store, from, to) |
copy within the same store |
store_delete(store, key) |
remove an object |
store_get_many(store, keys, concurrency) |
concurrent full-object reads |
store_get_ranges_many(store, keys, offsets, lengths, concurrency) |
concurrent byte-range reads |
store_head_bytes(store, keys, length, offset, concurrency) |
convenience wrapper for fixed-range reads |
store_list_many(store, prefixes, concurrency) |
concurrent listings across many prefixes |
store_list_delimited(store, prefix) |
one-level hierarchical listing (keys + common prefixes) |
Keys are strings; paths like "a/b/c.txt" are handled the same way
across every backend.
Local, in-memory, AWS S3, S3-compatible, GCS, and Azure backends are working and tested against:
- AWS —
sentinel-cogs(anonymous) - Pawsey —
estinel(credentialed and anonymous) - source.coop —
us-west-2.opendata.source.coop/ausantarctic/* - GCS —
gcp-public-data-arco-era5,gcp-public-data-landsat - Azure — Microsoft Planetary Computer via SAS tokens
Planned:
- generic HTTP backend
- vendored Rust dependencies (
rextendr::vendor_pkgs()) for CRAN / r-universe - integration with downstream packages (
rustycogsfor COG byte-range scanning,zarofor Zarr stores)
object_store— the underlying Rust crate (Apache Arrow project)obstore— Python bindings to the same crate (Development Seed)extendr— the R↔Rust bridge used here
MIT