Self-Improving Agent Experiment

Can one agent help another agent get better, one commit at a time?

This repo is a tiny prompt-engineering lab with a simple setup:

Codex (gpt-5.4, reasoning xhigh) is the improver.
GitHub Copilot CLI (gpt-5-mini) is the agent under test.
GitHub MCP tools provide live repository issues.
Katt runs the eval three times and records pass/fail, time, and token usage.

The goal is intentionally small: ask Copilot CLI to list the 5 most recent GitHub issues for a repo, then let Codex improve the instructions step by step until the answer becomes reliable.

This makes the experiment easy to audit:

Git shows every prompt change.
results/ keeps the before/after evidence.
The eval runs 3 times to expose randomness instead of hiding it.

Useful artifacts:

The Loop

Codex was not trying to be magical. It was doing engineering:

Read the failure.
Make one focused change.
Commit it.
Run the eval again.
Keep what helps. Fix what does not.

flowchart TD
  A["Read self-improvment.prompt.md"] --> B["Inspect results and failures"]
  B --> C["Edit task.prompt.md or task.eval.js"]
  C --> D["Commit the change"]
  D --> E["Run Katt against Copilot CLI"]
  E --> F["Measure pass/fail, time, tokens"]
  F --> G{"Better?"}
  G -->|Yes| H["Keep the change and continue"]
  G -->|No| B
  H --> B

In short: one agent coached another, and Git acted like the lab notebook.

What Codex Changed

The interesting part is that Codex did not just keep piling on more prompt text. It discovered that there were two different problems:

The tested agent needed clearer instructions.
The evaluator itself was brittle because it compared live GitHub data against stale snapshots.

That is the real story of the experiment.

gitGraph
  commit id: "bc096d4" tag: "baseline"
  commit id: "df9761c" tag: "format"
  commit id: "5ca4ce4" tag: "ranking"
  commit id: "422c325" tag: "live-eval"
  commit id: "93df064" tag: "live-data"
  commit id: "fca11b8" tag: "evidence"

Commit	What changed	Why it mattered
`bc096d4`	Baseline experiment setup	A minimal prompt created a clean starting point.
`df9761c`	Forced exact JSON output, exact fields, exact count, and no extra text	This attacked the first failure mode: Copilot answered conversationally instead of matching the expected schema.
`5ca4ce4`	Added explicit fetch depth, sorting, and tie-breaking rules	This reduced ranking mistakes when issues had very similar timestamps.
`422c325`	Switched the eval from stale snapshots to live GitHub issue validation and pinned the repo source	This was the big insight: sometimes the test is wrong, not just the agent.
`93df064`	Forced live MCP issue data, blocked local-file shortcuts, and required unlabeled issues to be included	This stopped the model from falling back to stale examples inside the repo.
`fca11b8`	Saved the before/after runs and Codex reasoning log	This preserved the experiment as evidence instead of just a final state.

Before vs After

The final result was better reliability, but not cheaper execution.

Metric	Before	After	Change
Passing evals	`0 / 3`	`3 / 3`	`+3`
Total runtime	`286,552 ms`	`364,460 ms`	`+27%`
Average runtime per eval	`95.5 s`	`121.5 s`	`+26.0 s`
Total tokens used	`329,625`	`893,151`	`+171%`
Average tokens per eval	`109,875`	`297,717`	`+171%`

Bonus stat: the improver was not free either. According to results/reasoning-improver.md, Codex spent about 58k tokens and 18m 48s to reach the final passing setup.

flowchart LR
  A["Before<br/>0 / 3 passing<br/>286,552 ms<br/>329,625 tokens"] --> B["After<br/>3 / 3 passing<br/>364,460 ms<br/>893,151 tokens"]

xychart-beta
  title "Total Runtime"
  x-axis ["Before", "After"]
  y-axis "Milliseconds" 0 --> 400000
  bar [286552, 364460]

xychart-beta
  title "Total Tokens Used"
  x-axis ["Before", "After"]
  y-axis "Tokens" 0 --> 950000
  bar [329625, 893151]

So the ending is not "the prompt got better and everything got better." The real ending is:

quality went up a lot
cost also went up a lot
the evaluation became more honest

That is a useful outcome. It gives you an actual tradeoff instead of a nice story.

Key Takeaways

Small prompt changes can materially improve reliability when they target a specific failure mode.
Better instructions were not enough by themselves. Codex also had to fix the judge.
Running the same eval 3 times was important. A single lucky pass would have hidden the instability.
The final setup is more reliable, but it also burns more time and tokens. Improvement is not free.

Why This Experiment Matters

This is a tiny example of a much bigger idea: agents can improve other agents when they have a feedback loop, a test harness, and an audit trail.

That is the inspiring part. Prompt engineering stops being "try some words and hope" and starts looking more like real software work:

observe
change one thing
measure
keep the evidence

Codex also left one more good lesson behind: if your results look strange, check the evaluator before you blame the model. Sometimes the smartest improvement is fixing the experiment itself.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
__snapshots__		__snapshots__
results		results
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
self-improvment.prompt.md		self-improvment.prompt.md
task.eval.js		task.eval.js
task.prompt.md		task.prompt.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Improving Agent Experiment

The Loop

What Codex Changed

Before vs After

Key Takeaways

Why This Experiment Matters

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Self-Improving Agent Experiment

The Loop

What Codex Changed

Before vs After

Key Takeaways

Why This Experiment Matters

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages