Skip to content

Surface RIMAPI action errors to agents + fix load-settle race#19

Open
jkbennitt wants to merge 1 commit into
masterfrom
fix/post-live-test-findings
Open

Surface RIMAPI action errors to agents + fix load-settle race#19
jkbennitt wants to merge 1 commit into
masterfrom
fix/post-live-test-findings

Conversation

@jkbennitt
Copy link
Copy Markdown
Member

Summary

Two fixes from live-game testing (2026-05-16 smoke run on Nemotron 120B-A12B, Crashlanded) where agents were re-proposing the same invalid action every tick because RIMAPI's 500-on-semantic-error was opaque to the agent loop.

  • Load-settle race in run_scenario.py: polling broke on first population > 0 (~4s after load_game), but RIMAPI returns 200 before Unity finishes applying the load. The immediately-following unforbid_all_items() POST then got 500'd. Now requires 5 consecutive stable-population checks (10s floor) before writes.
  • Action error feedback loop: ExecutionResult.outcomes now carries an ActionOutcome per dispatched action with success + cleaned-up error message (unwrapped from RIMAPI's {"errors":[...]} JSON envelope). game_loop broadcasts STATUS_UPDATE with failed outcomes via CentralPost, so agents see "DO NOT REPEAT" context next tick. Removes the "researcher keeps re-proposing already-finished Electricity" loop observed live.
  • Pre-existing positional telemetry bug fixed: i < executed had assumed actions failed in order; now uses per-action outcome flags.

Diff: 4 files, +167 / -13. 382 tests pass; ruff + mypy strict clean.

Why this matters for the benchmark

Agents currently can't learn from their own mistakes within a run — they see opaque server errors, retry the same invalid combo, and burn deliberation budget on no-ops. The ActionOutcome + CentralPost broadcast is observability and an agent learning loop in one piece. It's the first piece of the observability floor in the broader Phase A restructuring.

Test plan

  • pytest — 382 pass (4 new outcome-capture tests)
  • ruff check src/ tests/ scripts/ — clean
  • mypy src/ — strict clean
  • Re-run the same live smoke config (Nemotron 120B-A12B, Crashlanded, 10 ticks) on this branch to verify mood (0.408) and research (0.226) move under the new feedback loop — tracked as A10 in the project plan.

🤖 Generated with Claude Code

Two fixes from live-game testing that caused agents to fail and re-fail
the same invalid action every tick:

1) Load-settle race in run_scenario.py: the polling loop broke on first
   `population > 0`, which was ~4s after load_game. RIMAPI returns 200
   before Unity's main thread finishes applying the load, so the
   immediately-following unforbid_all_items() POST got 500'd. Now require
   5 consecutive stable-population checks (10s floor) before writes.

2) Surface per-action errors back to agents via CentralPost:
   - ExecutionResult.outcomes now carries an ActionOutcome per dispatched
     action with success + a cleaned-up error message (unwrapped from
     RIMAPI's {"errors":[...]} JSON envelope).
   - game_loop broadcasts STATUS_UPDATE with failed outcomes so agents
     see "DO NOT REPEAT" context next tick. Removes the "researcher
     keeps re-proposing already-finished Electricity" loop observed in
     the live run.
   - Also fixed pre-existing position-based telemetry bug
     (`i < executed` assumed actions failed in order).

4 new unit tests cover the outcome capture paths.
Tests: 382 pass. ruff/mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant