Skip to content

feat(gym): add multi-file-refactor and flaky-test-triage scenarios#1224

Closed
rysweet wants to merge 1 commit intomainfrom
engineer/add-more-gym-benchmark-scenarios-1777054629-85482b
Closed

feat(gym): add multi-file-refactor and flaky-test-triage scenarios#1224
rysweet wants to merge 1 commit intomainfrom
engineer/add-more-gym-benchmark-scenarios-1777054629-85482b

Conversation

@rysweet
Copy link
Copy Markdown
Owner

@rysweet rysweet commented Apr 24, 2026

Extends the gym benchmark suite with two more diverse scenarios beyond the bug-localization and long-context entries that landed in 1825941.

Scenarios added

  • multi-file-refactor-rename-symbol (Refactoring class, simard-engineer identity, local-harness, SingleProcess) — asks the agent to plan a behaviour-preserving rename of the public type BenchmarkScenario to GymScenario across the crate. The agent must enumerate call sites, propose an ordered edit sequence, and list verification gates (cargo check, cargo test --lib -- gym, cargo clippy --all-targets) without actually performing the rename.
  • flaky-test-triage-from-history (Debugging class, simard-engineer identity, local-harness, SingleProcess) — supplies a 6-run intermittent CI history for a single Rust integration test and asks the agent to classify the flake category (timing/race, ordering/iteration, external-dependency, resource-leak, or rng-seed), state a root-cause hypothesis, and propose a stabilisation plus verification strategy.

Both scenarios follow the same pattern as the bug-localize-from-cargo-test-output scenario added in 1825941 (verbatim evidence block in objective, structured short-form output contract).

Mechanics

  • Bumps BENCHMARK_SCENARIOS length from 149 → 151 in src/gym/scenarios.rs.
  • Updates the count assertion in src/gym/tests_scenarios.rs to 151.
  • cargo test --lib -- gym passes locally (418 passed, 0 failed).

Refs #1210

…1210)

Adds two more diverse BenchmarkScenario entries to src/gym/scenarios.rs,
extending the work landed in 1825941:

- multi-file-refactor-rename-symbol (Refactoring, simard-engineer,
  local-harness, SingleProcess) asks the agent to plan a behaviour-
  preserving rename of a public type across the crate, enumerating call
  sites and verification gates without performing the edit.

- flaky-test-triage-from-history (Debugging, simard-engineer,
  local-harness, SingleProcess) supplies a 6-run intermittent CI history
  for a single test and asks the agent to classify the flake category
  and propose a stabilisation plus verification strategy.

Bumps BENCHMARK_SCENARIOS length 149 -> 151 and updates the count
assertion in src/gym/tests_scenarios.rs. cargo test --lib -- gym
passes (418 passed, 0 failed).

Refs #1210

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@rysweet
Copy link
Copy Markdown
Owner Author

rysweet commented Apr 24, 2026

Closing as outdated. This PR was opened before #1221 / #1225 landed; rebasing it onto current main would delete the scenarios already added by those merged PRs (memory-recall-consolidated-episode, multi-file-rename-public-api, plan-decomposition-vague-goal, failure-recovery-dirty-worktree). The two scenarios this PR uniquely adds (multi-file-refactor-rename-symbol, flaky-test-triage-from-history) overlap conceptually with what's already in tree; if the daemon decides they're still wanted, she'll re-spawn them on top of current main.

@rysweet rysweet closed this Apr 24, 2026
@rysweet rysweet deleted the engineer/add-more-gym-benchmark-scenarios-1777054629-85482b branch April 24, 2026 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant