feat(gym): add multi-file-refactor and flaky-test-triage scenarios#1224
Closed
feat(gym): add multi-file-refactor and flaky-test-triage scenarios#1224
Conversation
…1210) Adds two more diverse BenchmarkScenario entries to src/gym/scenarios.rs, extending the work landed in 1825941: - multi-file-refactor-rename-symbol (Refactoring, simard-engineer, local-harness, SingleProcess) asks the agent to plan a behaviour- preserving rename of a public type across the crate, enumerating call sites and verification gates without performing the edit. - flaky-test-triage-from-history (Debugging, simard-engineer, local-harness, SingleProcess) supplies a 6-run intermittent CI history for a single test and asks the agent to classify the flake category and propose a stabilisation plus verification strategy. Bumps BENCHMARK_SCENARIOS length 149 -> 151 and updates the count assertion in src/gym/tests_scenarios.rs. cargo test --lib -- gym passes (418 passed, 0 failed). Refs #1210 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Owner
Author
|
Closing as outdated. This PR was opened before #1221 / #1225 landed; rebasing it onto current main would delete the scenarios already added by those merged PRs ( |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Extends the gym benchmark suite with two more diverse scenarios beyond the bug-localization and long-context entries that landed in 1825941.
Scenarios added
Refactoringclass,simard-engineeridentity,local-harness,SingleProcess) — asks the agent to plan a behaviour-preserving rename of the public typeBenchmarkScenariotoGymScenarioacross the crate. The agent must enumerate call sites, propose an ordered edit sequence, and list verification gates (cargo check,cargo test --lib -- gym,cargo clippy --all-targets) without actually performing the rename.Debuggingclass,simard-engineeridentity,local-harness,SingleProcess) — supplies a 6-run intermittent CI history for a single Rust integration test and asks the agent to classify the flake category (timing/race, ordering/iteration, external-dependency, resource-leak, or rng-seed), state a root-cause hypothesis, and propose a stabilisation plus verification strategy.Both scenarios follow the same pattern as the bug-localize-from-cargo-test-output scenario added in 1825941 (verbatim evidence block in
objective, structured short-form output contract).Mechanics
BENCHMARK_SCENARIOSlength from 149 → 151 insrc/gym/scenarios.rs.src/gym/tests_scenarios.rsto 151.cargo test --lib -- gympasses locally (418 passed, 0 failed).Refs #1210