Skip to content

feat: add WAL backfill for sled events and sync operations#21

Merged
dipankar merged 23 commits intoneul-labs:mainfrom
jhult:chore/git/wal
Apr 13, 2026
Merged

feat: add WAL backfill for sled events and sync operations#21
dipankar merged 23 commits intoneul-labs:mainfrom
jhult:chore/git/wal

Conversation

@jhult
Copy link
Copy Markdown
Contributor

@jhult jhult commented Apr 2, 2026

Note: Stacked on top of #20 which needs to be merged first.

Adds write-ahead logging to ensure event durability alongside sled storage:

  • Write events to WAL alongside sled inserts
  • Auto-backfill WAL before push and in the sync handler
  • grite doctor --fix backfills WAL from existing sled events
  • Handle push with no grite refs and use concrete refspecs

jhult added 23 commits March 30, 2026 16:40
Replace the nng-based IPC layer with tokio Unix sockets for concurrent request handling. Each client connection gets its own task, enabling true parallelism instead of nng's serial request-reply pattern.

Key changes:
- Daemon: tokio TcpListener-style accept loop with
  per-connection async tasks
- IPC client: length-prefixed framing over UnixStream with configurable timeouts
- Lock-based daemon discovery via PID file
- Remove libc and nng dependencies
Update documentation and doc comments to reflect the switch from nng to Unix domain sockets.
Remove libnng-dev (apt), nng (brew), and nng (vcpkg) install steps from CI and release workflows. Remove NNG_LIB_DIR env var and Homebrew depends_on "nng" since the nng dependency was replaced with tokio Unix sockets.
Remove discovery protocol types (DaemonAnnounce, DaemonQuery, DiscoveryMessage) that are no longer needed with Unix socket discovery. Add a descriptive LockRace error variant for the lock-file acquisition race condition.
Periodically remove finished worker handles from the map to prevent unbounded growth. Verify a socket is truly stale (not owned by a live process) before removing it on startup. Check the global socket for liveness in the daemon status command.
Add daemon_id, pid, and started_ts fields to DaemonStatus for better observability and stale daemon detection.
Use spawn_blocking for command execution and worker creation to avoid blocking the tokio runtime's
cooperative scheduler.
Limit concurrent connections to 256 using a tokio Semaphore. When the limit is reached, new connections are dropped immediately with a warning log rather than accumulating unbounded tasks.
Mark the IPC client as poisoned after timeout or IO error to prevent reading stale data from the stream. Fix the retry loop so reconnection happens before send on each retry, not after failure.
Replace scattered per-connection clones with a single
Arc<DaemonState> shared across all connection tasks. Unify the shutdown path so run() owns all cleanup (socket removal, worker shutdown), eliminating races between signal handler and supervisor. Fix shutdown_workers deadlock by draining worker handles before sending shutdown messages. Prevent orphaned workers by dropping the listener and waiting for all semaphore permits before draining.
Replace .unwrap() with .unwrap_or_default() on SystemTime::duration_since(UNIX_EPOCH) in worker.rs and lock.rs, matching the pattern already used in supervisor.rs. Prevents a panic if the system clock is before the Unix epoch.
Before provisioning a new actor, check whether a valid default actor already exists (repo config has default_actor and its directory contains a readable config). If so, skip creation and report the existing actor ID instead of silently overwriting it.

A new `action` field in JSON output distinguishes "created" from "existing". AGENTS.md handling runs either way since it was already idempotent.

To explicitly provision a fresh actor use `grite actor init` followed by `grite actor use <id>`.
Add repo_sled_path() returning .git/grite/sled - the single shared sled database location for the per-repo storage model. Export from the crate root alongside the existing actor_sled_path.
- open_store / sled_path use repo_sled_path(.git/grite/sled)
  instead of per-actor data_dir/sled
- execution_mode reads DaemonLock from .git/grite/daemon.lock
  instead of the per-actor data directory
- daemon start/stop/status use the same repo-level lock path
- grite init creates the shared sled at .git/grite/sled
Worker is now per-repository instead of per-(repo, actor):
- Remove data_dir field; add grite_dir (.git/grite) for lock
- Compute sled path via repo_sled_path(.git/grite/sled)
- actor_id moved from fixed Worker field to per-command in
  WorkerMessage::Command, parsed to bytes inside the event loop
- DaemonLock acquire/refresh/release use grite_dir
- Supervisor WorkerKey drops actor_id (keyed by repo_root only)
- create_worker takes owner_actor_id for the lock record only
Add a new orphaned_actors check that scans all actor directories under .git/grite/actors/, identifies actors that are not the current default, and counts events in their sleds that are absent from the current store.

With --fix, all missing events are copied into the current store and a rebuild is triggered to reapply projections in chronological order. This recovers issues that were created under an actor that was superseded (e.g. by accidentally running grite init again before it was made idempotent).

The check is skipped when the daemon holds the store lock, since direct sled access is not possible in that state.

In the shared-sled model all per-actor sled directories are legacy, not just non-default actors. Rename check_orphaned_actors to check_legacy_actor_sleds, remove the actor_id != current filter (all actors' sleds are candidates), and update messages to reflect that events are merged into the shared store rather than another actor's store.

After merging events from legacy per-actor sleds into the shared store, --fix now automatically removes the legacy sled directories. Also cleans up sleds where all events were already in the shared store.

When running --fix, the daemon must be stopped first to ensure proper store access for integrity checks and repairs. The daemon is automatically restarted after fixes complete if it was running.
storage.md: reflect new .git/grite/sled shared database, move daemon.lock to repo level, simplify actor dirs to identity-only, update multi-actor section and cleanup commands, add migration guide for per-actor sled legacy artifacts.

doctor.md: add legacy_actor_sleds check to the checks table and add a dedicated section describing the check, its warning conditions, and the --fix resolution.
Skip push when no refs/grite/* exist instead of erroring on the glob refspec.
The daemon worker was only persisting events to sled, never to the git WAL. This meant sync had nothing to push since no refs/grite/* refs were ever created.

Add a persist_events helper that writes to both sled and WAL. WAL append is best-effort (logged warning on failure) so sled operations still succeed even if git is unavailable.
Detect when sled has events but WAL is empty and auto-repair by replaying all events into the WAL. This recovers from the daemon bug where events were only written to sled.
When WAL is empty but sled has events, automatically backfill the WAL from sled before pushing. Also expose this as a doctor check with --fix for manual repair.
libgit2 push() does not expand glob refspecs -  it treats the * as a literal ref name. Enumerate matching refs and build concrete refspecs like refs/grite/wal:refs/grite/wal.
Add WAL backfill to the daemon's sync command path so that grite sync --push works even when the daemon holds the sled lock and the CLI can't open the store.
@dipankar dipankar merged commit 7b5f5cf into neul-labs:main Apr 13, 2026
3 of 7 checks passed
@jhult jhult deleted the chore/git/wal branch April 13, 2026 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants