feat: add WAL backfill for sled events and sync operations by jhult · Pull Request #21 · neul-labs/grite

jhult · 2026-04-02T02:07:07Z

Note: Stacked on top of #20 which needs to be merged first.

Adds write-ahead logging to ensure event durability alongside sled storage:

Write events to WAL alongside sled inserts
Auto-backfill WAL before push and in the sync handler
grite doctor --fix backfills WAL from existing sled events
Handle push with no grite refs and use concrete refspecs

Replace the nng-based IPC layer with tokio Unix sockets for concurrent request handling. Each client connection gets its own task, enabling true parallelism instead of nng's serial request-reply pattern. Key changes: - Daemon: tokio TcpListener-style accept loop with per-connection async tasks - IPC client: length-prefixed framing over UnixStream with configurable timeouts - Lock-based daemon discovery via PID file - Remove libc and nng dependencies

Update documentation and doc comments to reflect the switch from nng to Unix domain sockets.

Remove libnng-dev (apt), nng (brew), and nng (vcpkg) install steps from CI and release workflows. Remove NNG_LIB_DIR env var and Homebrew depends_on "nng" since the nng dependency was replaced with tokio Unix sockets.

Remove discovery protocol types (DaemonAnnounce, DaemonQuery, DiscoveryMessage) that are no longer needed with Unix socket discovery. Add a descriptive LockRace error variant for the lock-file acquisition race condition.

Periodically remove finished worker handles from the map to prevent unbounded growth. Verify a socket is truly stale (not owned by a live process) before removing it on startup. Check the global socket for liveness in the daemon status command.

Add daemon_id, pid, and started_ts fields to DaemonStatus for better observability and stale daemon detection.

Use spawn_blocking for command execution and worker creation to avoid blocking the tokio runtime's cooperative scheduler.

Limit concurrent connections to 256 using a tokio Semaphore. When the limit is reached, new connections are dropped immediately with a warning log rather than accumulating unbounded tasks.

Mark the IPC client as poisoned after timeout or IO error to prevent reading stale data from the stream. Fix the retry loop so reconnection happens before send on each retry, not after failure.

Replace scattered per-connection clones with a single Arc<DaemonState> shared across all connection tasks. Unify the shutdown path so run() owns all cleanup (socket removal, worker shutdown), eliminating races between signal handler and supervisor. Fix shutdown_workers deadlock by draining worker handles before sending shutdown messages. Prevent orphaned workers by dropping the listener and waiting for all semaphore permits before draining.

Replace .unwrap() with .unwrap_or_default() on SystemTime::duration_since(UNIX_EPOCH) in worker.rs and lock.rs, matching the pattern already used in supervisor.rs. Prevents a panic if the system clock is before the Unix epoch.

Before provisioning a new actor, check whether a valid default actor already exists (repo config has default_actor and its directory contains a readable config). If so, skip creation and report the existing actor ID instead of silently overwriting it. A new `action` field in JSON output distinguishes "created" from "existing". AGENTS.md handling runs either way since it was already idempotent. To explicitly provision a fresh actor use `grite actor init` followed by `grite actor use <id>`.

Add repo_sled_path() returning .git/grite/sled - the single shared sled database location for the per-repo storage model. Export from the crate root alongside the existing actor_sled_path.

- open_store / sled_path use repo_sled_path(.git/grite/sled) instead of per-actor data_dir/sled - execution_mode reads DaemonLock from .git/grite/daemon.lock instead of the per-actor data directory - daemon start/stop/status use the same repo-level lock path - grite init creates the shared sled at .git/grite/sled

Worker is now per-repository instead of per-(repo, actor): - Remove data_dir field; add grite_dir (.git/grite) for lock - Compute sled path via repo_sled_path(.git/grite/sled) - actor_id moved from fixed Worker field to per-command in WorkerMessage::Command, parsed to bytes inside the event loop - DaemonLock acquire/refresh/release use grite_dir - Supervisor WorkerKey drops actor_id (keyed by repo_root only) - create_worker takes owner_actor_id for the lock record only

Add a new orphaned_actors check that scans all actor directories under .git/grite/actors/, identifies actors that are not the current default, and counts events in their sleds that are absent from the current store. With --fix, all missing events are copied into the current store and a rebuild is triggered to reapply projections in chronological order. This recovers issues that were created under an actor that was superseded (e.g. by accidentally running grite init again before it was made idempotent). The check is skipped when the daemon holds the store lock, since direct sled access is not possible in that state. In the shared-sled model all per-actor sled directories are legacy, not just non-default actors. Rename check_orphaned_actors to check_legacy_actor_sleds, remove the actor_id != current filter (all actors' sleds are candidates), and update messages to reflect that events are merged into the shared store rather than another actor's store. After merging events from legacy per-actor sleds into the shared store, --fix now automatically removes the legacy sled directories. Also cleans up sleds where all events were already in the shared store. When running --fix, the daemon must be stopped first to ensure proper store access for integrity checks and repairs. The daemon is automatically restarted after fixes complete if it was running.

storage.md: reflect new .git/grite/sled shared database, move daemon.lock to repo level, simplify actor dirs to identity-only, update multi-actor section and cleanup commands, add migration guide for per-actor sled legacy artifacts. doctor.md: add legacy_actor_sleds check to the checks table and add a dedicated section describing the check, its warning conditions, and the --fix resolution.

Skip push when no refs/grite/* exist instead of erroring on the glob refspec.

The daemon worker was only persisting events to sled, never to the git WAL. This meant sync had nothing to push since no refs/grite/* refs were ever created. Add a persist_events helper that writes to both sled and WAL. WAL append is best-effort (logged warning on failure) so sled operations still succeed even if git is unavailable.

Detect when sled has events but WAL is empty and auto-repair by replaying all events into the WAL. This recovers from the daemon bug where events were only written to sled.

When WAL is empty but sled has events, automatically backfill the WAL from sled before pushing. Also expose this as a doctor check with --fix for manual repair.

libgit2 push() does not expand glob refspecs - it treats the * as a literal ref name. Enumerate matching refs and build concrete refspecs like refs/grite/wal:refs/grite/wal.

Add WAL backfill to the daemon's sync command path so that grite sync --push works even when the daemon holds the sled lock and the CLI can't open the store.

jhult added 23 commits March 30, 2026 16:40

docs: remove nng references

10bad1d

Update documentation and doc comments to reflect the switch from nng to Unix domain sockets.

ci: remove nng system dependency from all workflows

239bde9

Remove libnng-dev (apt), nng (brew), and nng (vcpkg) install steps from CI and release workflows. Remove NNG_LIB_DIR env var and Homebrew depends_on "nng" since the nng dependency was replaced with tokio Unix sockets.

refactor(ipc): remove unused types and add LockRace error

b5773f2

Remove discovery protocol types (DaemonAnnounce, DaemonQuery, DiscoveryMessage) that are no longer needed with Unix socket discovery. Add a descriptive LockRace error variant for the lock-file acquisition race condition.

feat(daemon): enrich DaemonStatus

e56664f

Add daemon_id, pid, and started_ts fields to DaemonStatus for better observability and stale daemon detection.

perf(daemon): offload blocking work from async runtime

3178805

Use spawn_blocking for command execution and worker creation to avoid blocking the tokio runtime's cooperative scheduler.

feat(daemon): add connection backpressure via semaphore

8815101

Limit concurrent connections to 256 using a tokio Semaphore. When the limit is reached, new connections are dropped immediately with a warning log rather than accumulating unbounded tasks.

fix(ipc): poison client and fix retry reconnection

fd95433

Mark the IPC client as poisoned after timeout or IO error to prevent reading stale data from the stream. Fix the retry loop so reconnection happens before send on each retry, not after failure.

fix: use unwrap_or_default for SystemTime in all crates

6747d6c

Replace .unwrap() with .unwrap_or_default() on SystemTime::duration_since(UNIX_EPOCH) in worker.rs and lock.rs, matching the pattern already used in supervisor.rs. Prevents a panic if the system clock is before the Unix epoch.

feat(core): add repo_sled_path for shared sled model

e9555a6

Add repo_sled_path() returning .git/grite/sled - the single shared sled database location for the per-repo storage model. Export from the crate root alongside the existing actor_sled_path.

fix(sync): handle push with no grite refs

3ff9040

Skip push when no refs/grite/* exist instead of erroring on the glob refspec.

fix(doctor): backfill WAL from sled events on --fix

bdf1770

Detect when sled has events but WAL is empty and auto-repair by replaying all events into the WAL. This recovers from the daemon bug where events were only written to sled.

feat(sync): auto-backfill WAL before push

3e6f5e0

When WAL is empty but sled has events, automatically backfill the WAL from sled before pushing. Also expose this as a doctor check with --fix for manual repair.

fix(sync): use concrete refspecs instead of glob

0cb51d9

libgit2 push() does not expand glob refspecs - it treats the * as a literal ref name. Enumerate matching refs and build concrete refspecs like refs/grite/wal:refs/grite/wal.

feat(daemon): auto-backfill WAL in sync handler

3b9c271

Add WAL backfill to the daemon's sync command path so that grite sync --push works even when the daemon holds the sled lock and the CLI can't open the store.

dipankar merged commit 7b5f5cf into neul-labs:main Apr 13, 2026
3 of 7 checks passed

jhult deleted the chore/git/wal branch April 13, 2026 22:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add WAL backfill for sled events and sync operations#21

feat: add WAL backfill for sled events and sync operations#21
dipankar merged 23 commits intoneul-labs:mainfrom
jhult:chore/git/wal

jhult commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jhult commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants