Status: Experimental / Alpha v0.1 — research artifact, not a production system. Expect breaking changes, synthetic-data-only validation, and known rough edges. See Known Limitations before building on it.
PDM is a behavioral prediction oracle — a small model that takes a stream of structured events (emails, CI failures, calendar reminders, PR reviews, code commits...) and predicts what the user will do next, how an assistant should respond, and when to proactively intervene.
It is not a chat model. It is not a language model. It consumes structured event dicts, not text, and its output is a set of predictions over small fixed vocabularies (50 actions, 5 routing lanes, etc.).
The interesting thing about this repo isn't the model — it's the autoresearch journey that shaped it. A tiny Karpathy-style autoresearch loop, run against an immutable evaluation harness, led us to some counterintuitive conclusions:
- A 10 MB encoder beats a 4 GB Gemma-backbone model at this task by 30%.
- Data quality matters more than model size.
- Pooling choice (mean vs causal + last-token) is an architecture decision hiding in what looks like an implementation detail.
- An immutable
prepare.pyeval harness is the single most important design decision for keeping the research honest.
Seven rounds of autoresearch and deliberate architecture changes:
| Round | Change | Composite | Checkpoint | Notes |
|---|---|---|---|---|
| 1 | Backbone (Gemma 3 1B + LoRA), 22 experiments | 0.325 | 4 GB | batch_16 winner |
| 2 | Encoder-only ablation | 0.395 | 10 MB | +21.5% composite, 400× smaller |
| 3 | Coherent synthetic data (event_type ↔ action) | 0.410 | 10 MB | Fixed a generator bug |
| 4 | 730-day dataset (2× data) | 0.420 | 10 MB | ECE 0.053 → 0.021 |
| 5 | Causal attention + last-token pooling | 0.423 | 10 MB | ECE 0.021 → 0.011 |
| 6 | Curriculum windows [5, 16, 32, 64, 128] | 0.408 | 10 MB | Short-scenario generalization |
| 7 | Proactive labels + DelayHead + class-weighted routing | 0.422 | 10 MB | All three previously-dead metrics revived |
The full story is in docs/AUTORESEARCH_JOURNEY.md. It's written to be read — not just a log, but a narrative of what we tried, what failed, and what the loop taught us about our own assumptions.
Six specialized heads, all scored against a frozen prepare.py harness:
| Head | Task | Output |
|---|---|---|
next_action |
Top-k next actions | Softmax over 50 actions (reply_email, check_ci, fix_bug, ...) |
routing |
Response lane | Softmax over 5 lanes: ask / suggest / draft / act / escalate |
proactive |
Should intervene? | Binary + calibrated probability |
delay |
Time until next action | Non-negative minutes (softplus regression) |
relevance |
Memory artifact ranking | Pairwise scoring |
calibration |
Confidence calibration | Temperature per head |
A weighted composite score combines the five evaluated heads:
composite = 0.30·next_action_top1 + 0.25·routing_accuracy
+ 0.20·relevance_precision + 0.15·proactive_acceptance
+ 0.10·(1 - calibration_error)
# 1. Clone and install
git clone https://github.com/YOUR_ORG/pdm
cd pdm
pip install -e ".[dev]"
# 2. Generate synthetic training data (deterministic, seed=42)
python -c "
from training.synthetic_data import KnowledgeWorkerPersona
import json
for days, name in [(30,'30d'),(90,'90d'),(365,'365d'),(730,'730d')]:
events = KnowledgeWorkerPersona(seed=42).generate(days=days)
with open(f'datasets/synthetic_{name}.jsonl', 'w') as f:
for e in events: f.write(json.dumps(e) + '\n')
"
# 3. Train (encoder-only + curriculum; ~10 min on MPS/CUDA, ~30 min on CPU)
python train.py
# 4. Start the playground + sidecar
uvicorn pdm_service.main:app --port 8787
# → http://127.0.0.1:8787/playground — interactive side-by-side model comparison
# → http://127.0.0.1:8787/docs — OpenAPI docs for /predict, /rank, /route
# 5. Run the autoresearch loop (~3 hours for 20 experiments)
python autoresearch.pyPDM ships with a web UI at http://127.0.0.1:8787/playground for interactively
comparing any two model checkpoints on predefined scenarios:
- 6 predefined scenarios (ci_failure, morning_email, deep_coding, afternoon_meetings, pr_review, end_of_day) — each anchored to a plausible time-of-day
- Side-by-side next-action, routing, and proactive predictions
- Dropdown to pick any two checkpoints — try
best_model_round1_backbonevsbest_model_round7to see the 30% composite improvement in practice
pdm/
├── README.md — You are here
├── LICENSE — Apache 2.0
├── CHANGELOG.md — Version history
├── CONTRIBUTING.md — How to contribute, autoresearch workflow
├── program.md — Autoresearch rules (what can/can't change)
├── pyproject.toml
│
├── train.py — Consolidated training script (encoder-only default)
├── autoresearch.py — Karpathy-style hyperparameter sweep
├── prepare.py — IMMUTABLE evaluation harness
├── compare.py — Model comparison CLI + SCENARIOS dict
├── results.tsv — Experiment leaderboard
│
├── models/
│ ├── event_encoder.py — 2-layer transformer, causal + last-token pooling
│ ├── prediction_heads.py — 6 heads: next_action, routing, proactive, delay, relevance, calibration
│ ├── pdm_model.py — Wires everything together
│ └── backbone.py — Optional Gemma 3 1B + LoRA (not recommended)
│
├── training/
│ ├── synthetic_data.py — KnowledgeWorkerPersona + label rules
│ ├── dataset.py — PdmDataset, window encoding
│ └── trainer.py — compute_loss, training loop utilities
│
├── pdm_service/
│ ├── main.py — FastAPI app
│ ├── api/
│ │ ├── routes_predict.py — /predict, /rank, /route
│ │ ├── routes_train.py — /train async background job
│ │ ├── routes_evaluate.py — /evaluate
│ │ └── routes_playground.py — /playground HTML + /api/compare
│ └── inference/
│ ├── model_registry.py — Model loading + A/B comparison cache
│ └── predictor.py — Inference wrapper
│
├── tests/ — 61 tests covering model, data, API, training
├── docs/
│ ├── AUTORESEARCH_JOURNEY.md — THE narrative — read this
│ └── ML_CONCEPTS.md — LoRA, epochs, composite score explained
├── datasets/ — Synthetic JSONL (gitignored, regenerated)
├── checkpoints/ — Model weights (gitignored)
└── eval_sets/ — Held-out evaluation data
128 events → [Event Encoder] 2-layer transformer, causal attention
│ last-token pooling
↓
[6 Prediction Heads]
│
├─→ next_action (50-way softmax)
├─→ routing (5-way softmax, class-weighted)
├─→ proactive (binary sigmoid)
├─→ delay (softplus regression, minutes)
├─→ relevance (pairwise scoring)
└─→ calibration (temperature per head)
Total parameters: ~2.7M trainable, checkpoint size: ~10 MB, load time: instant.
An optional Gemma 3 1B backbone path exists (USE_BACKBONE = True in
train.py) but is not recommended — it was shown to underperform the
encoder-only path by 30% composite while using 4 GB of RAM. It's kept for
reproducibility of the Round 1 experiments.
PDM uses a Karpathy-style research loop:
prepare.pydefines the metric and is immutableautoresearch.pyiterates over experiments (defined asExperimentdataclass instances)- For each experiment: rewrite
train.pyconfig from a template, runpython train.py, parse result - Log to
results.tsv; keep the checkpoint if composite improved - Restore
train.py; move to next experiment
See program.md for the formal rules of what can and cannot
be modified inside an autoresearch run.
Important: the loop only explores axes you put in it. Round 1 missed the
entire USE_BACKBONE = False branch because the template hardcoded it to
True. We only discovered encoder-only was better when we ablated the axis
manually. The current template in Round 2+ has use_backbone and
use_curriculum as first-class Experiment fields. Lesson: design your
search space to include the architectural assumptions you'd otherwise miss.
- Synthetic data only. All 7 rounds are evaluated on events generated by
KnowledgeWorkerPersona, a deterministic persona simulator. No real user data has been tested. How the patterns transfer to real event streams is an open question. - Next-action top-1 is ~15%. For a 50-class problem that's 7× random (~2%) but still leaves a lot of headroom. A bigger encoder, longer training, or retrieval-augmented variant could probably push this higher.
- Routing accuracy is ~0.35. Class-weighted cross-entropy (sqrt-inverse
frequency) improved this from near-majority-class collapse, but rare
classes (
escalate,ask) still have limited training signal. - Timing head is new in Round 7. It learns the regression but the absolute delay predictions are noisy. Treat the timing_accuracy metric as a directional signal, not a ground truth.
- Proactive head is new in Round 7. Labels are synthesized from event importance; real acceptance patterns are unknown.
- prepare.py amendment in Round 7. We made one deliberate amendment to
the immutable harness (read path changed to use
DelayHeadoutput). Metric formulas and composite weights stayed identical. See the journey doc for rationale. - No CI/CD. Tests exist but there's no GitHub Actions workflow yet.
- No PyPI release.
pyproject.tomlis set up for local dev installs.
# Run all tests (61 tests, ~3 seconds)
pytest tests/ -v
# Train a single experiment
python train.py
# Run the full autoresearch sweep (~3 hours, 20 experiments)
python autoresearch.py
# Start the sidecar with auto-reload for development
uvicorn pdm_service.main:app --reload --port 8787
# Compare models via CLI
python compare.py --scenario ci_failure
python compare.py --all-scenariosSee CONTRIBUTING.md for the full contributor guide.
Apache 2.0. See LICENSE.
- Gemma 3 1B (Google) — used as a feature-extraction backbone in the Round 1 experiments. The autoresearch loop ultimately found it was not the right choice for this task, but the ablation was the most valuable outcome of the whole project.
- The Karpathy-style autoresearch pattern — iterating on training recipes against a frozen evaluation harness — this repo is an existence proof that the pattern works on small-scale problems.