Skip to content

hypertidy/robstore

Repository files navigation

robstore

R bindings to the Rust object_store crate, providing a uniform interface to local filesystems and cloud object stores. Built with extendr.

The same small API (store_put, store_get, store_get_range, store_list, store_exists, store_delete, store_copy) works across every backend — switching between in-memory, local disk, AWS S3, S3-compatible endpoints like Pawsey or source.coop, Google Cloud Storage, and Azure Blob Storage is just a change of constructor.

For cloud workloads, store_get_many() and store_get_ranges_many() fan requests out concurrently through a shared tokio runtime, delivering 10–30× speedups over sequential per-request calls.

Installation

You will need a working Rust toolchain (see rustup.rs). Then:

# install.packages("remotes")
remotes::install_github("mdsumner/robstore")

Backends

Constructor Use for
memory_store() in-memory (tests, scratch)
local_store(path) local filesystem rooted at path
s3_store(...) AWS S3 or S3-compatible with credentials
s3_store_anonymous(...) public S3 buckets (e.g. sentinel-cogs, source.coop)
gcs_store(bucket) Google Cloud Storage with credentials (GOOGLE_APPLICATION_CREDENTIALS)
gcs_store_anonymous(bucket) public GCS buckets (e.g. gcp-public-data-arco-era5)
azure_store(account, container) Azure Blob Storage with env-var credentials
azure_store_sas(account, container, sas_token) Azure with a SAS token (e.g. Microsoft Planetary Computer)
azure_store_anonymous(account, container) Azure unsigned (limited — no anonymous listing)

Generic HTTP is planned.

A quick tour

In-memory

library(robstore)

s <- memory_store()
s
#> <MemoryStore>

store_put(s, "hello.txt", charToRaw("Hello, world!"))
rawToChar(store_get(s, "hello.txt"))
#> [1] "Hello, world!"

store_list(s, prefix = NULL)
#> [1] "hello.txt"

store_exists(s, "hello.txt")
#> [1] TRUE

# byte-range read
store_get_range(s, "hello.txt", offset = 7, length = 5) |> rawToChar()
#> [1] "world"

store_copy(s, "hello.txt", "bye.txt")
store_delete(s, "hello.txt")
store_list(s, prefix = NULL)
#      key size       last_modified etag
#1 bye.txt   13 2026-04-26 19:21:15    1

Local filesystem

tmp <- tempfile("rs-"); dir.create(tmp)
ls <- local_store(tmp)
ls
#> <LocalStore(/tmp/rs-abc123)>

# nested keys work — intermediate directories are created automatically
store_put(ls, "a/b/c.txt", charToRaw("nested"))
list.files(tmp, recursive = TRUE)
#> [1] "a/b/c.txt"

rawToChar(store_get(ls, "a/b/c.txt"))
#> [1] "nested"

Public S3 (anonymous) — Sentinel-2 COGs

A Sentinel-2 L2A scene from the public Element 84 archive, listed anonymously:

s <- s3_store_anonymous(
  bucket    = "sentinel-cogs",
  region    = "us-west-2",
  endpoint  = NULL,
  allow_http = FALSE
)
s
#> <S3Store[anon](sentinel-cogs @ us-west-2)>

keys <- store_list(s, prefix = "sentinel-s2-l2a-cogs/1/C/CV/2024/1/")
nrow(keys)
#> [1] 63
head(keys$key, 3)
#> [1] "sentinel-s2-l2a-cogs/1/C/CV/2024/1/S2B_1CCV_20240106_0_L2A/AOT.tif"
#> [2] "sentinel-s2-l2a-cogs/1/C/CV/2024/1/S2B_1CCV_20240106_0_L2A/B01.tif"
#> [3] "sentinel-s2-l2a-cogs/1/C/CV/2024/1/S2B_1CCV_20240106_0_L2A/B02.tif"

S3-compatible with credentials — Pawsey

For credentialed S3-compatible services (Pawsey, MinIO, Backblaze etc.), point the endpoint argument at the provider and supply credentials via the usual AWS environment variables:

# credentials only needed for writes or private buckets
Sys.setenv(
  AWS_ACCESS_KEY_ID     = "<your-key-id>",
  AWS_SECRET_ACCESS_KEY = "<your-secret>"
)

s <- s3_store_anonymous(
  bucket     = "estinel",
  region     = "",                                 # unused when endpoint is set
  endpoint   = "https://projects.pawsey.org.au",
  allow_http = FALSE
)

store_list(s, prefix = "sentinel-2-c1-l2a/2015")
#key    size       last_modified                               etag
#1    sentinel-2-c1-l2a/2015/10/23/Hobart_2015-10-23.tif 1721446 2025-11-18 02:06:20 "5ead10131897c6e441582c3e7ee86706"
#2    sentinel-2-c1-l2a/2015/11/12/Hobart_2015-11-12.tif 1184485 2025-11-18 02:06:20 "3bd9597091ff283081d93f89a8eeb6e7"
#3    sentinel-2-c1-l2a/2015/11/22/Hobart_2015-11-22.tif 1749887 2025-11-18 02:06:21 "d1b0a8fc089faf18a5c5ee8b7a74d5aa"
#4    sentinel-2-c1-l2a/2015/12/19/Hobart_2015-12-19.tif   12070 2025-11-18 02:06:20 "f04501a5d853e62f05bda75aeab4cdc9"
# ...

Parallel listing across prefixes

S3’s ListObjectsV2 is paginated (1000 keys per page) and each page depends on the previous one’s continuation token, so a single large listing is serial by protocol. For hierarchical layouts — years, MGRS tiles, product types — you can fan out across sub-prefixes with store_list_many() and turn one long serial listing into many short parallel ones.

Pawsey’s estinel bucket is public, so s3_store_anonymous() works against it too — no credentials needed:

s <- s3_store_anonymous(
  bucket     = "estinel",
  region     = "",
  endpoint   = "https://projects.pawsey.org.au",
  allow_http = FALSE
)

# serial — one paginated listing through the whole bucket
system.time({
  all_serial <- store_list(s, prefix = "sentinel-2-c1-l2a/")
})
#>    user  system elapsed
#>   2.441   0.657  43.132
length(all_serial)
#> [1] 552860

# parallel — one list call per year, fanned out
year_prefixes <- sprintf("sentinel-2-c1-l2a/%d/", 2015:2026)
system.time({
  all_parallel <- store_list_many(s, year_prefixes, concurrency = 12)
})
#>    user  system elapsed
#>   2.821   0.651   8.469
nrow(all_parallel)
#> [1] 552860

A 5× speedup over the serial listing, with the same 552,860 keys returned (store_list_many() does not preserve input order — sort the result if order matters).

Multi-tenant buckets — source.coop

source.coop is a multi-tenant S3 gateway: one large public bucket (us-west-2.opendata.source.coop) with each data provider getting a top-level prefix. From robstore’s perspective it’s just an anonymous AWS S3 bucket — the bucket name has dots in it, but that’s the only unusual thing:

s <- s3_store_anonymous(
  bucket     = "us-west-2.opendata.source.coop",
  region     = "us-west-2",
  endpoint   = NULL,
  allow_http = FALSE
)
s
#> <S3Store[anon](us-west-2.opendata.source.coop @ us-west-2)>

store_list_delimited(s, "ausantarctic/ghrsst-mur-v2/")
#> $keys
#> [1] "ausantarctic/ghrsst-mur-v2/README.md"
#> [2] "ausantarctic/ghrsst-mur-v2/ghrsst-mur-v2.parquet"
#>
#> $common_prefixes
#>  [1] "ausantarctic/ghrsst-mur-v2/2002"
#>  [2] "ausantarctic/ghrsst-mur-v2/2003"
#>  ...
#> [25] "ausantarctic/ghrsst-mur-v2/2026"

# fan-out list across years — same pattern as Pawsey
year_prefixes <- sprintf("ausantarctic/ghrsst-mur-v2/%d/", 2002:2026)
system.time({
  all_keys <- store_list_many(s, year_prefixes, concurrency = 16)
})
#>    user  system elapsed
#>   0.297   0.175   2.266
dim(all_keys)
#> [1] 46124

# filter to the TIFs
tif_keys <- grepv("\\.tif$", all_keys$key)
tail(tif_keys)
#> [1] "ausantarctic/ghrsst-mur-v2/2026/04/15/20260415090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1_sst_anomaly.tif"
#> ...

Concurrent byte-range reads

store_get_range() is fast — reqwest over rustls with connection pooling — but calling it in an R loop is still one request at a time. store_get_ranges_many() issues up to concurrency range-GET requests simultaneously through a shared tokio runtime, returning a list of raw vectors in input order. This is the primitive for COG IFD walks, Zarr chunk scatter-reads, and Parquet row-group reads.

s <- s3_store_anonymous("sentinel-cogs", "us-west-2", NULL, FALSE)
keys <- store_list(s, prefix = "sentinel-s2-l2a-cogs/1/C/CV/2024/1/")
length(keys)
#> [1] 63

# sequential — one GET at a time
system.time({
  heads_seq <- lapply(keys, \(k) store_get_range(s, k, 0, 65536))
})
#>    user  system elapsed
#>   0.026   0.065  19.333

# concurrent fan-out — 16 inflight
system.time({
  heads <- store_get_ranges_many(
    s,
    keys    = keys,
    offsets = rep(0, length(keys)),
    lengths = rep(65536, length(keys)),
    concurrency = 16
  )
})
#>    user  system elapsed
#>   0.036   0.044   1.625

# pushing concurrency higher
system.time({
  heads_32 <- store_get_ranges_many(
    s, keys,
    rep(0, length(keys)), rep(65536, length(keys)),
    concurrency = 32
  )
})
#>    user  system elapsed
#>   0.033   0.052   0.767

Note the user/system columns: essentially zero CPU work in R while Rust drives the reactor.

store_head_bytes() — convenience wrapper

For the common pattern of reading the first N bytes of many files (COG headers, Parquet footers, Zarr .zarray files), store_head_bytes() wraps store_get_ranges_many() with fixed offset and length:

hdrs <- store_head_bytes(s, keys, length = 65536, concurrency = 32)
length(hdrs)
#> [1] 63

# pass directly to your COG parser / IFD scanner of choice

Scaling

At the time of writing, store_list() can return half a million keys from a single prefix call (tested against sentinel-s2-l2a-cogs/1/C/ which returned 543,889 keys). For concurrent byte-range reads at that scale, batch the keys in R rather than holding all the returned raw vectors in memory at once:

keys_huge <- store_list(s, prefix = "sentinel-s2-l2a-cogs/1/C/")
nrow(keys_huge)
#> [1] 543889

# process in chunks of 1000
batches <- split(keys_huge, ceiling(seq_along(keys_huge$key) / 1000))
for (batch in batches) {
  hdrs <- store_head_bytes(s, batch$key, length = 16384, concurrency = 64)
  # ... extract IFD offsets, write references, etc.
}

Public GCS — Landsat and ERA5

gcs_store_anonymous() opens a public Google Cloud Storage bucket with unsigned requests. Combined with store_list_delimited() it’s an efficient way to explore hierarchical archives like the Landsat mirror:

s <- gcs_store_anonymous("gcp-public-data-landsat")

store_list_delimited(s, NULL)
#> $keys
#> [1] "index.csv.gz"
#>
#> $common_prefixes
#>  [1] "LC08" "LE07" "LM01" "LM02" "LM03" "LM04" "LM05" "LO08" "LT04" "LT05" "LT08"

store_list_delimited(s, "LC08/01/")$common_prefixes |> length()
#> [1] 233    # WRS-2 path directories

store_list_delimited(s, "LC08/01/090/")$common_prefixes |> length()
#> [1] 67     # rows within path 090

For analysis-ready Zarr stores, the cloud-native pattern is to fetch the consolidated metadata and compute chunk keys rather than listing them. The ERA5 archive on GCS exposes .zmetadata as a single ~130 KB JSON file that describes the entire store:

era <- gcs_store_anonymous("gcp-public-data-arco-era5")

meta <- store_get(era, "ar/full_37-1h-0p25deg-chunk-1.zarr-v3/.zmetadata")
length(meta)
#> [1] 132785

cat(substring(rawToChar(meta), 1, 400))
#> {"metadata": {".zattrs": {"valid_time_start": "1940-01-01",
#>  "last_updated": "2026-04-17 02:54:09...",
#>  "valid_time_stop": "2025-12-31", ...},
#>  "100m_u_component_of_wind/.zarray": {
#>    "chunks": [1, 721, 1440],
#>    "compressor": {"cname": "lz4", "id": "blosc", ...},
#>    "dtype": "<f4",
#>    "shape": [1323648, 721, 1440], ...}}

That 130 KB JSON describes 85 years of hourly global reanalysis across hundreds of variables. The typical workflow is:

  1. store_get() the consolidated metadata (one request).
  2. Parse the JSON in R to learn shape, chunk grid, compressor, dtype.
  3. Compute exactly the chunk keys you need for your (variable, time, lat_range, lon_range) window.
  4. store_get_many() those chunks concurrently.

No listing of chunks — the metadata tells you the keys directly. robstore provides the concurrent byte-fetch primitives; a Zarr-aware layer (e.g. a future zaro integration) handles the codec pipeline and array assembly.

Azure Blob Storage — Microsoft Planetary Computer

Azure’s auth model differs from S3 and GCS: anonymous listing is generally not permitted even on “public” containers. Most public Azure datasets — the Microsoft Planetary Computer collections in particular — are accessed via short-lived SAS tokens from a free token-minting endpoint:

https://planetarycomputer.microsoft.com/api/sas/v1/token/{account}/{container}

Fetch a token, hand it to azure_store_sas(), and listing and reads work the same as any other backend:

library(httr2)

# Free SAS token from Planetary Computer (~1 hour lifetime)
token <- request(
  "https://planetarycomputer.microsoft.com/api/sas/v1/token/sentinel1euwestrtc/sentinel1-grd-rtc"
) |>
  req_perform() |>
  resp_body_json()

s <- azure_store_sas(
  account   = "sentinel1euwestrtc",
  container = "sentinel1-grd-rtc",
  sas_token = token$token
)
s
#> <AzureStore[sas](sentinel1euwestrtc/sentinel1-grd-rtc)>

# Navigate the Sentinel-1 GRD layout with delimited listings
store_list_delimited(s, NULL)$common_prefixes
#> [1] "GRD"

store_list_delimited(s, "GRD/")$common_prefixes
#>  [1] "GRD/2014" "GRD/2015" "GRD/2016" "GRD/2017" "GRD/2018" "GRD/2019"
#>  [7] "GRD/2020" "GRD/2021" "GRD/2022" "GRD/2023" "GRD/2024" "GRD/2025"
#> [13] "GRD/2026"

store_list_delimited(s, "GRD/2024/")$common_prefixes
#>  [1] "GRD/2024/1"  "GRD/2024/10" "GRD/2024/11" "GRD/2024/12"
#>  [5] "GRD/2024/2"  "GRD/2024/3"  ...

For your own Azure storage accounts, set the usual environment variables and use azure_store():

Sys.setenv(
  AZURE_STORAGE_ACCOUNT_NAME = "myaccount",
  AZURE_STORAGE_ACCOUNT_KEY  = "..."
)
s <- azure_store("myaccount", "mycontainer")

azure_store_anonymous() exists for completeness but will fail on most containers — Azure does not generally permit unsigned listing on public data.

API

All operations take a Store object as their first argument.

Function Description
store_put(store, key, data) write a raw vector
store_get(store, key) read an object as raw
store_get_range(store, key, offset, length) read a byte range
store_exists(store, key) TRUE/FALSE
store_list(store, prefix) character vector of keys
store_copy(store, from, to) copy within the same store
store_delete(store, key) remove an object
store_get_many(store, keys, concurrency) concurrent full-object reads
store_get_ranges_many(store, keys, offsets, lengths, concurrency) concurrent byte-range reads
store_head_bytes(store, keys, length, offset, concurrency) convenience wrapper for fixed-range reads
store_list_many(store, prefixes, concurrency) concurrent listings across many prefixes
store_list_delimited(store, prefix) one-level hierarchical listing (keys + common prefixes)

Keys are strings; paths like "a/b/c.txt" are handled the same way across every backend.

Development notes

Local, in-memory, AWS S3, S3-compatible, GCS, and Azure backends are working and tested against:

  • AWS — sentinel-cogs (anonymous)
  • Pawsey — estinel (credentialed and anonymous)
  • source.coop — us-west-2.opendata.source.coop/ausantarctic/*
  • GCS — gcp-public-data-arco-era5, gcp-public-data-landsat
  • Azure — Microsoft Planetary Computer via SAS tokens

Planned:

  • generic HTTP backend
  • vendored Rust dependencies (rextendr::vendor_pkgs()) for CRAN / r-universe
  • integration with downstream packages (rustycogs for COG byte-range scanning, zaro for Zarr stores)

Related

  • object_store — the underlying Rust crate (Apache Arrow project)
  • obstore — Python bindings to the same crate (Development Seed)
  • extendr — the R↔Rust bridge used here

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages