Can one agent help another agent get better, one commit at a time?
This repo is a tiny prompt-engineering lab with a simple setup:
Codex(gpt-5.4, reasoningxhigh) is the improver.GitHub Copilot CLI(gpt-5-mini) is the agent under test.- GitHub MCP tools provide live repository issues.
Kattruns the eval three times and records pass/fail, time, and token usage.
The goal is intentionally small: ask Copilot CLI to list the 5 most recent GitHub issues for a repo, then let Codex improve the instructions step by step until the answer becomes reliable.
This makes the experiment easy to audit:
- Git shows every prompt change.
results/keeps the before/after evidence.- The eval runs 3 times to expose randomness instead of hiding it.
Useful artifacts:
Codex was not trying to be magical. It was doing engineering:
- Read the failure.
- Make one focused change.
- Commit it.
- Run the eval again.
- Keep what helps. Fix what does not.
flowchart TD
A["Read self-improvment.prompt.md"] --> B["Inspect results and failures"]
B --> C["Edit task.prompt.md or task.eval.js"]
C --> D["Commit the change"]
D --> E["Run Katt against Copilot CLI"]
E --> F["Measure pass/fail, time, tokens"]
F --> G{"Better?"}
G -->|Yes| H["Keep the change and continue"]
G -->|No| B
H --> B
In short: one agent coached another, and Git acted like the lab notebook.
The interesting part is that Codex did not just keep piling on more prompt text. It discovered that there were two different problems:
- The tested agent needed clearer instructions.
- The evaluator itself was brittle because it compared live GitHub data against stale snapshots.
That is the real story of the experiment.
gitGraph
commit id: "bc096d4" tag: "baseline"
commit id: "df9761c" tag: "format"
commit id: "5ca4ce4" tag: "ranking"
commit id: "422c325" tag: "live-eval"
commit id: "93df064" tag: "live-data"
commit id: "fca11b8" tag: "evidence"
| Commit | What changed | Why it mattered |
|---|---|---|
bc096d4 |
Baseline experiment setup | A minimal prompt created a clean starting point. |
df9761c |
Forced exact JSON output, exact fields, exact count, and no extra text | This attacked the first failure mode: Copilot answered conversationally instead of matching the expected schema. |
5ca4ce4 |
Added explicit fetch depth, sorting, and tie-breaking rules | This reduced ranking mistakes when issues had very similar timestamps. |
422c325 |
Switched the eval from stale snapshots to live GitHub issue validation and pinned the repo source | This was the big insight: sometimes the test is wrong, not just the agent. |
93df064 |
Forced live MCP issue data, blocked local-file shortcuts, and required unlabeled issues to be included | This stopped the model from falling back to stale examples inside the repo. |
fca11b8 |
Saved the before/after runs and Codex reasoning log | This preserved the experiment as evidence instead of just a final state. |
The final result was better reliability, but not cheaper execution.
| Metric | Before | After | Change |
|---|---|---|---|
| Passing evals | 0 / 3 |
3 / 3 |
+3 |
| Total runtime | 286,552 ms |
364,460 ms |
+27% |
| Average runtime per eval | 95.5 s |
121.5 s |
+26.0 s |
| Total tokens used | 329,625 |
893,151 |
+171% |
| Average tokens per eval | 109,875 |
297,717 |
+171% |
Bonus stat: the improver was not free either. According to results/reasoning-improver.md, Codex spent about 58k tokens and 18m 48s to reach the final passing setup.
flowchart LR
A["Before<br/>0 / 3 passing<br/>286,552 ms<br/>329,625 tokens"] --> B["After<br/>3 / 3 passing<br/>364,460 ms<br/>893,151 tokens"]
xychart-beta
title "Total Runtime"
x-axis ["Before", "After"]
y-axis "Milliseconds" 0 --> 400000
bar [286552, 364460]
xychart-beta
title "Total Tokens Used"
x-axis ["Before", "After"]
y-axis "Tokens" 0 --> 950000
bar [329625, 893151]
So the ending is not "the prompt got better and everything got better." The real ending is:
- quality went up a lot
- cost also went up a lot
- the evaluation became more honest
That is a useful outcome. It gives you an actual tradeoff instead of a nice story.
- Small prompt changes can materially improve reliability when they target a specific failure mode.
- Better instructions were not enough by themselves. Codex also had to fix the judge.
- Running the same eval 3 times was important. A single lucky pass would have hidden the instability.
- The final setup is more reliable, but it also burns more time and tokens. Improvement is not free.
This is a tiny example of a much bigger idea: agents can improve other agents when they have a feedback loop, a test harness, and an audit trail.
That is the inspiring part. Prompt engineering stops being "try some words and hope" and starts looking more like real software work:
- observe
- change one thing
- measure
- keep the evidence
Codex also left one more good lesson behind: if your results look strange, check the evaluator before you blame the model. Sometimes the smartest improvement is fixing the experiment itself.