robstore

R bindings to the Rust object_store crate, providing a uniform interface to local filesystems and cloud object stores. Built with extendr.

The same small API (store_put, store_get, store_get_range, store_list, store_exists, store_delete, store_copy) works across every backend — switching between in-memory, local disk, AWS S3, S3-compatible endpoints like Pawsey or source.coop, Google Cloud Storage, and Azure Blob Storage is just a change of constructor.

For cloud workloads, store_get_many() and store_get_ranges_many() fan requests out concurrently through a shared tokio runtime, delivering 10–30× speedups over sequential per-request calls.

Installation

You will need a working Rust toolchain (see rustup.rs). Then:

# install.packages("remotes")
remotes::install_github("mdsumner/robstore")

Backends

Constructor	Use for
`memory_store()`	in-memory (tests, scratch)
`local_store(path)`	local filesystem rooted at `path`
`s3_store(...)`	AWS S3 or S3-compatible with credentials
`s3_store_anonymous(...)`	public S3 buckets (e.g. `sentinel-cogs`, `source.coop`)
`gcs_store(bucket)`	Google Cloud Storage with credentials (GOOGLE_APPLICATION_CREDENTIALS)
`gcs_store_anonymous(bucket)`	public GCS buckets (e.g. `gcp-public-data-arco-era5`)
`azure_store(account, container)`	Azure Blob Storage with env-var credentials
`azure_store_sas(account, container, sas_token)`	Azure with a SAS token (e.g. Microsoft Planetary Computer)
`azure_store_anonymous(account, container)`	Azure unsigned (limited — no anonymous listing)

Generic HTTP is planned.

A quick tour

In-memory

library(robstore)

s <- memory_store()
s
#> <MemoryStore>

store_put(s, "hello.txt", charToRaw("Hello, world!"))
rawToChar(store_get(s, "hello.txt"))
#> [1] "Hello, world!"

store_list(s, prefix = NULL)
#> [1] "hello.txt"

store_exists(s, "hello.txt")
#> [1] TRUE

# byte-range read
store_get_range(s, "hello.txt", offset = 7, length = 5) |> rawToChar()
#> [1] "world"

store_copy(s, "hello.txt", "bye.txt")
store_delete(s, "hello.txt")
store_list(s, prefix = NULL)
#      key size       last_modified etag
#1 bye.txt   13 2026-04-26 19:21:15    1

Local filesystem

tmp <- tempfile("rs-"); dir.create(tmp)
ls <- local_store(tmp)
ls
#> <LocalStore(/tmp/rs-abc123)>

# nested keys work — intermediate directories are created automatically
store_put(ls, "a/b/c.txt", charToRaw("nested"))
list.files(tmp, recursive = TRUE)
#> [1] "a/b/c.txt"

rawToChar(store_get(ls, "a/b/c.txt"))
#> [1] "nested"

Public S3 (anonymous) — Sentinel-2 COGs

A Sentinel-2 L2A scene from the public Element 84 archive, listed anonymously:

s <- s3_store_anonymous(
  bucket    = "sentinel-cogs",
  region    = "us-west-2",
  endpoint  = NULL,
  allow_http = FALSE
)
s
#> <S3Store[anon](sentinel-cogs @ us-west-2)>

keys <- store_list(s, prefix = "sentinel-s2-l2a-cogs/1/C/CV/2024/1/")
nrow(keys)
#> [1] 63
head(keys$key, 3)
#> [1] "sentinel-s2-l2a-cogs/1/C/CV/2024/1/S2B_1CCV_20240106_0_L2A/AOT.tif"
#> [2] "sentinel-s2-l2a-cogs/1/C/CV/2024/1/S2B_1CCV_20240106_0_L2A/B01.tif"
#> [3] "sentinel-s2-l2a-cogs/1/C/CV/2024/1/S2B_1CCV_20240106_0_L2A/B02.tif"

S3-compatible with credentials — Pawsey

For credentialed S3-compatible services (Pawsey, MinIO, Backblaze etc.), point the endpoint argument at the provider and supply credentials via the usual AWS environment variables:

# credentials only needed for writes or private buckets
Sys.setenv(
  AWS_ACCESS_KEY_ID     = "<your-key-id>",
  AWS_SECRET_ACCESS_KEY = "<your-secret>"
)

s <- s3_store_anonymous(
  bucket     = "estinel",
  region     = "",                                 # unused when endpoint is set
  endpoint   = "https://projects.pawsey.org.au",
  allow_http = FALSE
)

store_list(s, prefix = "sentinel-2-c1-l2a/2015")
#key    size       last_modified                               etag
#1    sentinel-2-c1-l2a/2015/10/23/Hobart_2015-10-23.tif 1721446 2025-11-18 02:06:20 "5ead10131897c6e441582c3e7ee86706"
#2    sentinel-2-c1-l2a/2015/11/12/Hobart_2015-11-12.tif 1184485 2025-11-18 02:06:20 "3bd9597091ff283081d93f89a8eeb6e7"
#3    sentinel-2-c1-l2a/2015/11/22/Hobart_2015-11-22.tif 1749887 2025-11-18 02:06:21 "d1b0a8fc089faf18a5c5ee8b7a74d5aa"
#4    sentinel-2-c1-l2a/2015/12/19/Hobart_2015-12-19.tif   12070 2025-11-18 02:06:20 "f04501a5d853e62f05bda75aeab4cdc9"
# ...

Parallel listing across prefixes

S3’s ListObjectsV2 is paginated (1000 keys per page) and each page depends on the previous one’s continuation token, so a single large listing is serial by protocol. For hierarchical layouts — years, MGRS tiles, product types — you can fan out across sub-prefixes with store_list_many() and turn one long serial listing into many short parallel ones.

Pawsey’s estinel bucket is public, so s3_store_anonymous() works against it too — no credentials needed:

s <- s3_store_anonymous(
  bucket     = "estinel",
  region     = "",
  endpoint   = "https://projects.pawsey.org.au",
  allow_http = FALSE
)

# serial — one paginated listing through the whole bucket
system.time({
  all_serial <- store_list(s, prefix = "sentinel-2-c1-l2a/")
})
#>    user  system elapsed
#>   2.441   0.657  43.132
length(all_serial)
#> [1] 552860

# parallel — one list call per year, fanned out
year_prefixes <- sprintf("sentinel-2-c1-l2a/%d/", 2015:2026)
system.time({
  all_parallel <- store_list_many(s, year_prefixes, concurrency = 12)
})
#>    user  system elapsed
#>   2.821   0.651   8.469
nrow(all_parallel)
#> [1] 552860

A 5× speedup over the serial listing, with the same 552,860 keys returned (store_list_many() does not preserve input order — sort the result if order matters).

Multi-tenant buckets — source.coop

source.coop is a multi-tenant S3 gateway: one large public bucket (us-west-2.opendata.source.coop) with each data provider getting a top-level prefix. From robstore’s perspective it’s just an anonymous AWS S3 bucket — the bucket name has dots in it, but that’s the only unusual thing:

s <- s3_store_anonymous(
  bucket     = "us-west-2.opendata.source.coop",
  region     = "us-west-2",
  endpoint   = NULL,
  allow_http = FALSE
)
s
#> <S3Store[anon](us-west-2.opendata.source.coop @ us-west-2)>

store_list_delimited(s, "ausantarctic/ghrsst-mur-v2/")
#> $keys
#> [1] "ausantarctic/ghrsst-mur-v2/README.md"
#> [2] "ausantarctic/ghrsst-mur-v2/ghrsst-mur-v2.parquet"
#>
#> $common_prefixes
#>  [1] "ausantarctic/ghrsst-mur-v2/2002"
#>  [2] "ausantarctic/ghrsst-mur-v2/2003"
#>  ...
#> [25] "ausantarctic/ghrsst-mur-v2/2026"

# fan-out list across years — same pattern as Pawsey
year_prefixes <- sprintf("ausantarctic/ghrsst-mur-v2/%d/", 2002:2026)
system.time({
  all_keys <- store_list_many(s, year_prefixes, concurrency = 16)
})
#>    user  system elapsed
#>   0.297   0.175   2.266
dim(all_keys)
#> [1] 46124

# filter to the TIFs
tif_keys <- grepv("\\.tif$", all_keys$key)
tail(tif_keys)
#> [1] "ausantarctic/ghrsst-mur-v2/2026/04/15/20260415090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1_sst_anomaly.tif"
#> ...

Concurrent byte-range reads

store_get_range() is fast — reqwest over rustls with connection pooling — but calling it in an R loop is still one request at a time. store_get_ranges_many() issues up to concurrency range-GET requests simultaneously through a shared tokio runtime, returning a list of raw vectors in input order. This is the primitive for COG IFD walks, Zarr chunk scatter-reads, and Parquet row-group reads.

s <- s3_store_anonymous("sentinel-cogs", "us-west-2", NULL, FALSE)
keys <- store_list(s, prefix = "sentinel-s2-l2a-cogs/1/C/CV/2024/1/")
length(keys)
#> [1] 63

# sequential — one GET at a time
system.time({
  heads_seq <- lapply(keys, \(k) store_get_range(s, k, 0, 65536))
})
#>    user  system elapsed
#>   0.026   0.065  19.333

# concurrent fan-out — 16 inflight
system.time({
  heads <- store_get_ranges_many(
    s,
    keys    = keys,
    offsets = rep(0, length(keys)),
    lengths = rep(65536, length(keys)),
    concurrency = 16
  )
})
#>    user  system elapsed
#>   0.036   0.044   1.625

# pushing concurrency higher
system.time({
  heads_32 <- store_get_ranges_many(
    s, keys,
    rep(0, length(keys)), rep(65536, length(keys)),
    concurrency = 32
  )
})
#>    user  system elapsed
#>   0.033   0.052   0.767

Note the user/system columns: essentially zero CPU work in R while Rust drives the reactor.

`store_head_bytes()` — convenience wrapper

For the common pattern of reading the first N bytes of many files (COG headers, Parquet footers, Zarr .zarray files), store_head_bytes() wraps store_get_ranges_many() with fixed offset and length:

hdrs <- store_head_bytes(s, keys, length = 65536, concurrency = 32)
length(hdrs)
#> [1] 63

# pass directly to your COG parser / IFD scanner of choice

Scaling

At the time of writing, store_list() can return half a million keys from a single prefix call (tested against sentinel-s2-l2a-cogs/1/C/ which returned 543,889 keys). For concurrent byte-range reads at that scale, batch the keys in R rather than holding all the returned raw vectors in memory at once:

keys_huge <- store_list(s, prefix = "sentinel-s2-l2a-cogs/1/C/")
nrow(keys_huge)
#> [1] 543889

# process in chunks of 1000
batches <- split(keys_huge, ceiling(seq_along(keys_huge$key) / 1000))
for (batch in batches) {
  hdrs <- store_head_bytes(s, batch$key, length = 16384, concurrency = 64)
  # ... extract IFD offsets, write references, etc.
}

Public GCS — Landsat and ERA5

gcs_store_anonymous() opens a public Google Cloud Storage bucket with unsigned requests. Combined with store_list_delimited() it’s an efficient way to explore hierarchical archives like the Landsat mirror:

s <- gcs_store_anonymous("gcp-public-data-landsat")

store_list_delimited(s, NULL)
#> $keys
#> [1] "index.csv.gz"
#>
#> $common_prefixes
#>  [1] "LC08" "LE07" "LM01" "LM02" "LM03" "LM04" "LM05" "LO08" "LT04" "LT05" "LT08"

store_list_delimited(s, "LC08/01/")$common_prefixes |> length()
#> [1] 233    # WRS-2 path directories

store_list_delimited(s, "LC08/01/090/")$common_prefixes |> length()
#> [1] 67     # rows within path 090

For analysis-ready Zarr stores, the cloud-native pattern is to fetch the consolidated metadata and compute chunk keys rather than listing them. The ERA5 archive on GCS exposes .zmetadata as a single ~130 KB JSON file that describes the entire store:

era <- gcs_store_anonymous("gcp-public-data-arco-era5")

meta <- store_get(era, "ar/full_37-1h-0p25deg-chunk-1.zarr-v3/.zmetadata")
length(meta)
#> [1] 132785

cat(substring(rawToChar(meta), 1, 400))
#> {"metadata": {".zattrs": {"valid_time_start": "1940-01-01",
#>  "last_updated": "2026-04-17 02:54:09...",
#>  "valid_time_stop": "2025-12-31", ...},
#>  "100m_u_component_of_wind/.zarray": {
#>    "chunks": [1, 721, 1440],
#>    "compressor": {"cname": "lz4", "id": "blosc", ...},
#>    "dtype": "<f4",
#>    "shape": [1323648, 721, 1440], ...}}

That 130 KB JSON describes 85 years of hourly global reanalysis across hundreds of variables. The typical workflow is:

store_get() the consolidated metadata (one request).
Parse the JSON in R to learn shape, chunk grid, compressor, dtype.
Compute exactly the chunk keys you need for your (variable, time, lat_range, lon_range) window.
store_get_many() those chunks concurrently.

No listing of chunks — the metadata tells you the keys directly. robstore provides the concurrent byte-fetch primitives; a Zarr-aware layer (e.g. a future zaro integration) handles the codec pipeline and array assembly.

Azure Blob Storage — Microsoft Planetary Computer

Azure’s auth model differs from S3 and GCS: anonymous listing is generally not permitted even on “public” containers. Most public Azure datasets — the Microsoft Planetary Computer collections in particular — are accessed via short-lived SAS tokens from a free token-minting endpoint:

https://planetarycomputer.microsoft.com/api/sas/v1/token/{account}/{container}

Fetch a token, hand it to azure_store_sas(), and listing and reads work the same as any other backend:

library(httr2)

# Free SAS token from Planetary Computer (~1 hour lifetime)
token <- request(
  "https://planetarycomputer.microsoft.com/api/sas/v1/token/sentinel1euwestrtc/sentinel1-grd-rtc"
) |>
  req_perform() |>
  resp_body_json()

s <- azure_store_sas(
  account   = "sentinel1euwestrtc",
  container = "sentinel1-grd-rtc",
  sas_token = token$token
)
s
#> <AzureStore[sas](sentinel1euwestrtc/sentinel1-grd-rtc)>

# Navigate the Sentinel-1 GRD layout with delimited listings
store_list_delimited(s, NULL)$common_prefixes
#> [1] "GRD"

store_list_delimited(s, "GRD/")$common_prefixes
#>  [1] "GRD/2014" "GRD/2015" "GRD/2016" "GRD/2017" "GRD/2018" "GRD/2019"
#>  [7] "GRD/2020" "GRD/2021" "GRD/2022" "GRD/2023" "GRD/2024" "GRD/2025"
#> [13] "GRD/2026"

store_list_delimited(s, "GRD/2024/")$common_prefixes
#>  [1] "GRD/2024/1"  "GRD/2024/10" "GRD/2024/11" "GRD/2024/12"
#>  [5] "GRD/2024/2"  "GRD/2024/3"  ...

For your own Azure storage accounts, set the usual environment variables and use azure_store():

Sys.setenv(
  AZURE_STORAGE_ACCOUNT_NAME = "myaccount",
  AZURE_STORAGE_ACCOUNT_KEY  = "..."
)
s <- azure_store("myaccount", "mycontainer")

azure_store_anonymous() exists for completeness but will fail on most containers — Azure does not generally permit unsigned listing on public data.

API

All operations take a Store object as their first argument.

Function	Description
`store_put(store, key, data)`	write a raw vector
`store_get(store, key)`	read an object as raw
`store_get_range(store, key, offset, length)`	read a byte range
`store_exists(store, key)`	`TRUE`/`FALSE`
`store_list(store, prefix)`	character vector of keys
`store_copy(store, from, to)`	copy within the same store
`store_delete(store, key)`	remove an object
`store_get_many(store, keys, concurrency)`	concurrent full-object reads
`store_get_ranges_many(store, keys, offsets, lengths, concurrency)`	concurrent byte-range reads
`store_head_bytes(store, keys, length, offset, concurrency)`	convenience wrapper for fixed-range reads
`store_list_many(store, prefixes, concurrency)`	concurrent listings across many prefixes
`store_list_delimited(store, prefix)`	one-level hierarchical listing (keys + common prefixes)

Keys are strings; paths like "a/b/c.txt" are handled the same way across every backend.

Development notes

Local, in-memory, AWS S3, S3-compatible, GCS, and Azure backends are working and tested against:

AWS — sentinel-cogs (anonymous)
Pawsey — estinel (credentialed and anonymous)
source.coop — us-west-2.opendata.source.coop/ausantarctic/*
GCS — gcp-public-data-arco-era5, gcp-public-data-landsat
Azure — Microsoft Planetary Computer via SAS tokens

Planned:

generic HTTP backend
vendored Rust dependencies (rextendr::vendor_pkgs()) for CRAN / r-universe
integration with downstream packages (rustycogs for COG byte-range scanning, zaro for Zarr stores)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github		.github
R		R
man		man
src		src
tools		tools
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
configure		configure
configure.win		configure.win
robstore.Rproj		robstore.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

robstore

Installation

Backends

A quick tour

In-memory

Local filesystem

Public S3 (anonymous) — Sentinel-2 COGs

S3-compatible with credentials — Pawsey

Parallel listing across prefixes

Multi-tenant buckets — source.coop

Concurrent byte-range reads

`store_head_bytes()` — convenience wrapper

Scaling

Public GCS — Landsat and ERA5

Azure Blob Storage — Microsoft Planetary Computer

API

Development notes

Related

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

robstore

Installation

Backends

A quick tour

In-memory

Local filesystem

Public S3 (anonymous) — Sentinel-2 COGs

S3-compatible with credentials — Pawsey

Parallel listing across prefixes

Multi-tenant buckets — source.coop

Concurrent byte-range reads

store_head_bytes() — convenience wrapper

Scaling

Public GCS — Landsat and ERA5

Azure Blob Storage — Microsoft Planetary Computer

API

Development notes

Related

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`store_head_bytes()` — convenience wrapper

Packages