Standalone autonomous agent/prompt optimisation loop. No Claude Code dependency.
Inspired by Karpathy's autoresearch pattern. Model-agnostic via OpenRouter. MIT-licensed. One file. One metric. One loop.
Set the GOAL → Script runs the LOOP → You wake up to better agents
LOOP (forever or N times):
1. Read current agent config (agent.yaml)
2. Ask optimiser LLM: "Analyse failures, propose ONE change"
3. Validate the proposed YAML + check constraints
4. Apply change → run eval suite → measure pass rate
5. If improved → git commit, update baseline
6. If worse → git reset, try something else
7. Repeat
The three primitives:
- Editable asset →
agent.yaml(the system prompt / config the optimiser modifies) - Scalar metric → eval pass rate from
evals.yaml(binary assertions) - Git as memory → every improvement is a commit, every regression is a revert
# 1. Create workspace
mkdir my-agent-lab && cd my-agent-lab && git init
# 2. Copy the script
cp /path/to/autoresearch_agents.py .
# 3. Scaffold default files (creates agent.yaml, evals.yaml, program.md)
python autoresearch_agents.py --loops 0
# 4. Edit your files (see examples/ below)
# 5. Set API key and run
export OPENROUTER_API_KEY=sk-or-...
python autoresearch_agents.py --loops 50| Provider | Env Variable | Default Model | Notes |
|---|---|---|---|
openrouter |
OPENROUTER_API_KEY |
anthropic/claude-sonnet-4 |
Recommended — access to all models |
anthropic |
ANTHROPIC_API_KEY |
claude-sonnet-4-20250514 |
Direct Anthropic API |
openai |
OPENAI_API_KEY |
gpt-4o |
Direct OpenAI API |
Pro tip: Use a stronger model as the optimiser and a cheaper model as the agent-under-test:
python autoresearch_agents.py \
--model anthropic/claude-opus-4 \
--agent-model anthropic/claude-haiku-4 \
--loops 50my-agent-lab/
├── agent.yaml ← THE EDITABLE ASSET (optimiser modifies this)
├── evals.yaml ← Binary eval assertions (read-only)
├── program.md ← Research instructions & constraints (read-only)
├── autoresearch_agents.py ← This script
└── results.jsonl ← Generated: append-only iteration log
Assertion types:
| Type | Description | Example |
|---|---|---|
contains |
Output must contain value | {type: contains, value: "Paris"} |
not_contains |
Output must NOT contain value | {type: not_contains, value: "I don't know"} |
contains_any |
Output must contain at least one | {type: contains_any, values: ["yes", "correct"]} |
contains_all |
Output must contain all values | {type: contains_all, values: ["CO2", "ocean"]} |
max_length |
Output must be under N chars | {type: max_length, value: 500} |
min_length |
Output must be over N chars | {type: min_length, value: 100} |
starts_with |
Output must start with value | {type: starts_with, value: "Here"} |
regex |
Output must match regex pattern | {type: regex, value: "\\d{4}"} |
- Lines marked
# CONSTRAINT:in agent.yaml are never removed - evals.yaml is never modified by the optimiser
- program.md defines boundaries the optimiser must respect
- Git revert on any regression — your best config is always recoverable
- Stagnation detection stops the loop if no progress after N iterations
pip install pyyaml httpx
That's it. No frameworks. No lock-in.
MIT