feat(gym): add multi-file-refactor and flaky-test-triage scenarios by rysweet · Pull Request #1224 · rysweet/Simard

rysweet · 2026-04-24T18:24:30Z

Extends the gym benchmark suite with two more diverse scenarios beyond the bug-localization and long-context entries that landed in 1825941.

Scenarios added

multi-file-refactor-rename-symbol (Refactoring class, simard-engineer identity, local-harness, SingleProcess) — asks the agent to plan a behaviour-preserving rename of the public type BenchmarkScenario to GymScenario across the crate. The agent must enumerate call sites, propose an ordered edit sequence, and list verification gates (cargo check, cargo test --lib -- gym, cargo clippy --all-targets) without actually performing the rename.
flaky-test-triage-from-history (Debugging class, simard-engineer identity, local-harness, SingleProcess) — supplies a 6-run intermittent CI history for a single Rust integration test and asks the agent to classify the flake category (timing/race, ordering/iteration, external-dependency, resource-leak, or rng-seed), state a root-cause hypothesis, and propose a stabilisation plus verification strategy.

Both scenarios follow the same pattern as the bug-localize-from-cargo-test-output scenario added in 1825941 (verbatim evidence block in objective, structured short-form output contract).

Mechanics

Bumps BENCHMARK_SCENARIOS length from 149 → 151 in src/gym/scenarios.rs.
Updates the count assertion in src/gym/tests_scenarios.rs to 151.
cargo test --lib -- gym passes locally (418 passed, 0 failed).

Refs #1210

…1210) Adds two more diverse BenchmarkScenario entries to src/gym/scenarios.rs, extending the work landed in 1825941: - multi-file-refactor-rename-symbol (Refactoring, simard-engineer, local-harness, SingleProcess) asks the agent to plan a behaviour- preserving rename of a public type across the crate, enumerating call sites and verification gates without performing the edit. - flaky-test-triage-from-history (Debugging, simard-engineer, local-harness, SingleProcess) supplies a 6-run intermittent CI history for a single test and asks the agent to classify the flake category and propose a stabilisation plus verification strategy. Bumps BENCHMARK_SCENARIOS length 149 -> 151 and updates the count assertion in src/gym/tests_scenarios.rs. cargo test --lib -- gym passes (418 passed, 0 failed). Refs #1210 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

rysweet · 2026-04-24T19:17:53Z

Closing as outdated. This PR was opened before #1221 / #1225 landed; rebasing it onto current main would delete the scenarios already added by those merged PRs (memory-recall-consolidated-episode, multi-file-rename-public-api, plan-decomposition-vague-goal, failure-recovery-dirty-worktree). The two scenarios this PR uniquely adds (multi-file-refactor-rename-symbol, flaky-test-triage-from-history) overlap conceptually with what's already in tree; if the daemon decides they're still wanted, she'll re-spawn them on top of current main.

rysweet closed this Apr 24, 2026

rysweet deleted the engineer/add-more-gym-benchmark-scenarios-1777054629-85482b branch April 24, 2026 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gym): add multi-file-refactor and flaky-test-triage scenarios#1224

feat(gym): add multi-file-refactor and flaky-test-triage scenarios#1224
rysweet wants to merge 1 commit intomainfrom
engineer/add-more-gym-benchmark-scenarios-1777054629-85482b

rysweet commented Apr 24, 2026

Uh oh!

rysweet commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rysweet commented Apr 24, 2026

Scenarios added

Mechanics

Uh oh!

rysweet commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant