feat(mcap): stateful replay daemon with RPC control#150
Conversation
Replace the one-shot mcap2keelson replayer (258 lines) with a long-lived stateful daemon (1328 lines) controllable over Zenoh RPC. - New McapReplayControl RPC service (12 procedures: load_file, play, pause, stop, seek, set_speed, set_loop, step, set_segment, set_channel_filter, list_files, get_status) with a STOPPED/PLAYING/PAUSED/LOADING state machine. - Broadcasts keelson.ReplayStatus on the `replay_status` subject (5 Hz while playing, 1 Hz idle) carrying DaemonInfo so clients can discover replayers from the stream alone; async load lifecycle/errors surface via load_progress_pct / last_load_error. - New interfaces/McapReplayControl.proto, messages/payloads/ ReplayStatus.proto, and the `replay_status` subject. Folds in the typed ErrorResponse.Code enum (NOT_FOUND, OUT_OF_RANGE, INVALID_STATE, PERMISSION_DENIED, ...) — the replay daemon is its sole consumer, mapping every RPC error to a code. Adds a 977-line e2e suite (24 tests) covering the RPC surface, state transitions, seek/segment/speed/step/channel-filter, async load, the path-traversal guard and the self-publish loopback guard; adapts the battle/cli/e2e tests to the daemon shape. The replay README section is rewritten; the keelson2mcap recording docs (dual @target -k patterns) on main are preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
freol35241
left a comment
There was a problem hiding this comment.
Review — 🔴 Changes requested
@TedSjoblom — flagging you as the owner for these changes (you authored the replay daemon on the trial branch; I've assigned the PR to you). (Posted as a COMMENT-event review rather than a formal "Request changes" because GitHub doesn't allow requesting changes on a PR you opened — treat this as changes-requested.)
Strong engineering overall — the concurrency model (single STATE_LOCK, the PAUSE_EVENT / COMMAND_EVENT split, the iterator-break pattern) is disciplined, and the e2e coverage is broad. Two things gate a merge: one blocking defect, plus a design question that determines whether the full control surface is justified.
Design question (please answer before more investment)
play / pause / seek / set_speed / step / segment-loop is a media-scrubber control surface, which is only justified by an interactive consumer — a UI that scrubs through a recording. If the real use is "replay a file onto the bus for a test/demo," the previous one-shot replayer was the right scope and this is a lot of machinery for it. Is there a replay-control UI (or a concrete plan for one)? The answer decides whether we proceed as-is or trim the surface.
Blocking
- Spin-wait pegs a CPU core while playing (inline at the timing loop). Must fix before merge.
Should-fix
- No-summary-statistics files load half-broken, silently — seek/segment/progress disabled with no error (inline).
prior_staterevert on load failure is dead/misleading (inline).
Nice-to-have
- The loop path (
set_loop+ EOF re-seek) reads correctly but has no end-to-end test — please add one.
Details inline. Happy to pair on any of these.
…k, dead revert Resolves the changes-requested review on the stateful replay daemon (PR #150): - Blocking: replace the time.sleep(0.0) spin-wait in _walk_iterator with bounded min(remaining, 5ms) sleeps so the long-lived daemon idles between messages instead of pegging a CPU core (and starving the status/RPC threads of the GIL) through quiet stretches of a recording. - Should-fix: recover start/end/count/channels by scanning files whose MCAP summary lacks statistics (new _scan_file, run off STATE_LOCK), so seek/set_segment/progress work — and the scrubber UI's timeline unlocks — instead of silently degrading to a zeroed [0,0] range. Extend the loopback guard to the scanned topics. - Should-fix: drop the dead prior_state capture/revert in _load_file; the load worker's forced STOPPED is the single honest load-failure transition. - Add e2e coverage: loop re-seek on EOF (climb-then-reset), and no-summary scan recovery (load + mid-file seek succeed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the thorough review — pushed Design question — yes, there's an interactive consumerThe You're right that This also reframes the no-summary item as a real UI bug rather than just degraded ergonomics: the scrubber locks its timeline when a file reports Changes in
|
_walk_iterator only reset its wall-clock timing anchor (first / reference_wall_ns) at start, seek, and step — never on resume-from-pause or a mid-playback speed change. The anchor keeps ticking through a pause, so on resume the messages that came 'due' during the pause burst out at once (the test shows 17 messages in ~50 us); a mid-file speed-up likewise bursts and a slow-down stalls. Both paths are driven by the crowsnest replay UI's pause button and live speed dropdown. Re-anchor when resuming from a pause or when the speed changed since the last message, mirroring the step/seek re-anchor already in the loop. Add test_resume_after_pause_does_not_burst, which asserts post-resume messages arrive spread over wall-clock time at the recording's ~50 ms cadence — it fails without the fix (burst within milliseconds). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@TedSjoblom — pushed Why trimYou confirmed (and the crowsnest
What else fell out for freeDropping Preserved untouchedYour three review fixes — bounded-sleep timing, the no-summary scan fallback, and the resume/speed-change clock re-anchor — are all intact. Tests
Net−447 lines of source. |
Drop step, set_segment, set_channel_filter, and get_status from the replay control interface, keeping the 8 procedures the crowsnest replay UI actually drives (list_files, load_file, play, pause, stop, seek, set_speed, set_loop). Removing get_status also removes its return type, so the duplicate RPC-local ReplayStatus/DaemonInfo — previously hand-synced with the broadcast payload — are gone; state is now read solely from the replay_status broadcast. Rename the interface McapReplayControl -> ReplayControl (and McapReplaySuccessResponse -> ReplaySuccessResponse, file -> interfaces/ReplayControl.proto). The control contract is recording-format-agnostic — nothing in it is MCAP-specific — so it follows the same naming rule as the other interfaces (VehicleControl, Configurable): named for the domain concept, not the connector that implements it. Format identity stays on the connector (mcap-replay). The service name is not part of the Zenoh RPC key, so this is wire- compatible; only generated type names change. Matches the already format-neutral keelson.ReplayStatus broadcast payload. Also remove the now-unused segment/filter fields from the keelson.ReplayStatus pub payload, the corresponding daemon state and walk-loop branches, and the dead request summarizers. The bounded-sleep, no-summary scan fallback, and resume/speed clock re-anchor fixes are preserved untouched. Tests: observe state via the replay_status broadcast instead of get_status; drop the step/segment/filter tests; add two data-plane e2e tests that were missing — payload fidelity + per-channel completeness on the original keys (multi-channel), and the --replay-key-tag /replay suffix on the wire. 22 replay e2e + 81 mcap unit + 102 SDK tests green. Follow-up filed as #156 (shared RPC-server dispatcher in scaffolding). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # messages/subjects.yaml
5da8d52 to
f52ddb4
Compare
|
Heads-up — renamed the interface and force-pushed the branch (re-fetch / reset your local before pulling).
Wire-compatible — the service name isn't part of the Zenoh RPC key ( |
Summary
Replaces the one-shot
mcap2keelsonreplayer (258 lines) with a long-lived stateful replay daemon (1328 lines) controllable over Zenoh RPC. PR 5 (final) of the5.1.0V-trialsplit, and the largest.What it does
McapReplayControlRPC service — 12 procedures:load_file,play,pause,stop,seek,set_speed,set_loop,step,set_segment,set_channel_filter,list_files,get_status. STOPPED / PLAYING / PAUSED / LOADING state machine; timing preserved on the walk;load_fileruns on a worker thread and returns "accepted" immediately.replay_statusbroadcast (keelson.ReplayStatus) at 5 Hz while playing / 1 Hz idle, carryingDaemonInfo(version/host/base-dir) so clients can discover replayers from the stream; async load progress + last error surface via the status.interfaces/McapReplayControl.proto,messages/payloads/ReplayStatus.proto,replay_statussubject.Folded-in prerequisite
Includes the typed
ErrorResponse.Codeenum (was the closed #144) — the replay daemon is its only consumer, mapping every RPC failure to a code (NOT_FOUND,OUT_OF_RANGE,INVALID_STATE,PERMISSION_DENIED, …).Applied onto current main
Main hadn't touched
mcap2keelson.pyor the test files since the trial's base, so those are taken as-authored. The only divergence wasmcap/README.md: I kept main's recording-example changes (the dual@target-kpatterns) and swapped in the trial's rewritten "## MCAP Replay" section. No new dependencies (requirements.txt/lock unchanged).Validation
uv run pytest -m "not e2e" connectors/mcap/— 81 passed.uv run pytest -m e2e connectors/mcap/— 48 passed (incl. the 24-test 977-line replay RPC e2e: state transitions, seek/segment/speed/step/channel-filter, async load LOADING→PAUSED, path-traversalPERMISSION_DENIED, loopback guard).mcap2keelson --helpruns; ruff + black clean; Python + JS SDKs regenerate cleanly.Review notes — recommended follow-ups (not fixed here; faithful extraction)
From the earlier review of this code, worth addressing before relying on it in production:
start/end/countdefault to 0, so seek/segment/progress silently break (the old replayer fell back to a scan). Not covered by tests.sleep(0.0)) can peg a core on sparse recordings now that it's a long-lived daemon._load_fileprior_staterevert capturesLOADINGrather than the true pre-load state (self-corrects via the worker's outer handler, but the revert path is dead).set_loop+ EOF re-seek) is toggled but not exercised end-to-end.Independent of PR 3 (#147, two-stage shutdown), but benefits from it — this is the first long-lived blocking daemon.
🤖 Generated with Claude Code