Skip to content

raphaelpor/self-improving-agentic-workflow

Repository files navigation

Self-Improving Agent Experiment

infog1

Can one agent help another agent get better, one commit at a time?

This repo is a tiny prompt-engineering lab with a simple setup:

  • Codex (gpt-5.4, reasoning xhigh) is the improver.
  • GitHub Copilot CLI (gpt-5-mini) is the agent under test.
  • GitHub MCP tools provide live repository issues.
  • Katt runs the eval three times and records pass/fail, time, and token usage.

The goal is intentionally small: ask Copilot CLI to list the 5 most recent GitHub issues for a repo, then let Codex improve the instructions step by step until the answer becomes reliable.

This makes the experiment easy to audit:

  • Git shows every prompt change.
  • results/ keeps the before/after evidence.
  • The eval runs 3 times to expose randomness instead of hiding it.

Useful artifacts:

The Loop

Codex was not trying to be magical. It was doing engineering:

  1. Read the failure.
  2. Make one focused change.
  3. Commit it.
  4. Run the eval again.
  5. Keep what helps. Fix what does not.
flowchart TD
  A["Read self-improvment.prompt.md"] --> B["Inspect results and failures"]
  B --> C["Edit task.prompt.md or task.eval.js"]
  C --> D["Commit the change"]
  D --> E["Run Katt against Copilot CLI"]
  E --> F["Measure pass/fail, time, tokens"]
  F --> G{"Better?"}
  G -->|Yes| H["Keep the change and continue"]
  G -->|No| B
  H --> B
Loading

In short: one agent coached another, and Git acted like the lab notebook.

What Codex Changed

The interesting part is that Codex did not just keep piling on more prompt text. It discovered that there were two different problems:

  • The tested agent needed clearer instructions.
  • The evaluator itself was brittle because it compared live GitHub data against stale snapshots.

That is the real story of the experiment.

gitGraph
  commit id: "bc096d4" tag: "baseline"
  commit id: "df9761c" tag: "format"
  commit id: "5ca4ce4" tag: "ranking"
  commit id: "422c325" tag: "live-eval"
  commit id: "93df064" tag: "live-data"
  commit id: "fca11b8" tag: "evidence"
Loading
Commit What changed Why it mattered
bc096d4 Baseline experiment setup A minimal prompt created a clean starting point.
df9761c Forced exact JSON output, exact fields, exact count, and no extra text This attacked the first failure mode: Copilot answered conversationally instead of matching the expected schema.
5ca4ce4 Added explicit fetch depth, sorting, and tie-breaking rules This reduced ranking mistakes when issues had very similar timestamps.
422c325 Switched the eval from stale snapshots to live GitHub issue validation and pinned the repo source This was the big insight: sometimes the test is wrong, not just the agent.
93df064 Forced live MCP issue data, blocked local-file shortcuts, and required unlabeled issues to be included This stopped the model from falling back to stale examples inside the repo.
fca11b8 Saved the before/after runs and Codex reasoning log This preserved the experiment as evidence instead of just a final state.

Before vs After

The final result was better reliability, but not cheaper execution.

Metric Before After Change
Passing evals 0 / 3 3 / 3 +3
Total runtime 286,552 ms 364,460 ms +27%
Average runtime per eval 95.5 s 121.5 s +26.0 s
Total tokens used 329,625 893,151 +171%
Average tokens per eval 109,875 297,717 +171%

Bonus stat: the improver was not free either. According to results/reasoning-improver.md, Codex spent about 58k tokens and 18m 48s to reach the final passing setup.

flowchart LR
  A["Before<br/>0 / 3 passing<br/>286,552 ms<br/>329,625 tokens"] --> B["After<br/>3 / 3 passing<br/>364,460 ms<br/>893,151 tokens"]
Loading
xychart-beta
  title "Total Runtime"
  x-axis ["Before", "After"]
  y-axis "Milliseconds" 0 --> 400000
  bar [286552, 364460]
Loading
xychart-beta
  title "Total Tokens Used"
  x-axis ["Before", "After"]
  y-axis "Tokens" 0 --> 950000
  bar [329625, 893151]
Loading

So the ending is not "the prompt got better and everything got better." The real ending is:

  • quality went up a lot
  • cost also went up a lot
  • the evaluation became more honest

That is a useful outcome. It gives you an actual tradeoff instead of a nice story.

Key Takeaways

  • Small prompt changes can materially improve reliability when they target a specific failure mode.
  • Better instructions were not enough by themselves. Codex also had to fix the judge.
  • Running the same eval 3 times was important. A single lucky pass would have hidden the instability.
  • The final setup is more reliable, but it also burns more time and tokens. Improvement is not free.

Why This Experiment Matters

This is a tiny example of a much bigger idea: agents can improve other agents when they have a feedback loop, a test harness, and an audit trail.

That is the inspiring part. Prompt engineering stops being "try some words and hope" and starts looking more like real software work:

  • observe
  • change one thing
  • measure
  • keep the evidence

Codex also left one more good lesson behind: if your results look strange, check the evaluator before you blame the model. Sometimes the smartest improvement is fixing the experiment itself.

About

Experiment for Self Improving agents using Katt

Topics

Resources

License

Stars

Watchers

Forks

Contributors