PDM — Personal Dynamics Model

Status: Experimental / Alpha v0.1 — research artifact, not a production system. Expect breaking changes, synthetic-data-only validation, and known rough edges. See Known Limitations before building on it.

PDM is a behavioral prediction oracle — a small model that takes a stream of structured events (emails, CI failures, calendar reminders, PR reviews, code commits...) and predicts what the user will do next, how an assistant should respond, and when to proactively intervene.

It is not a chat model. It is not a language model. It consumes structured event dicts, not text, and its output is a set of predictions over small fixed vocabularies (50 actions, 5 routing lanes, etc.).

The interesting thing about this repo isn't the model — it's the autoresearch journey that shaped it. A tiny Karpathy-style autoresearch loop, run against an immutable evaluation harness, led us to some counterintuitive conclusions:

A 10 MB encoder beats a 4 GB Gemma-backbone model at this task by 30%.
Data quality matters more than model size.
Pooling choice (mean vs causal + last-token) is an architecture decision hiding in what looks like an implementation detail.
An immutable prepare.py eval harness is the single most important design decision for keeping the research honest.

The Scoreboard

Seven rounds of autoresearch and deliberate architecture changes:

Round	Change	Composite	Checkpoint	Notes
1	Backbone (Gemma 3 1B + LoRA), 22 experiments	0.325	4 GB	`batch_16` winner
2	Encoder-only ablation	0.395	10 MB	+21.5% composite, 400× smaller
3	Coherent synthetic data (event_type ↔ action)	0.410	10 MB	Fixed a generator bug
4	730-day dataset (2× data)	0.420	10 MB	ECE 0.053 → 0.021
5	Causal attention + last-token pooling	0.423	10 MB	ECE 0.021 → 0.011
6	Curriculum windows [5, 16, 32, 64, 128]	0.408	10 MB	Short-scenario generalization
7	Proactive labels + DelayHead + class-weighted routing	0.422	10 MB	All three previously-dead metrics revived

The full story is in docs/AUTORESEARCH_JOURNEY.md. It's written to be read — not just a log, but a narrative of what we tried, what failed, and what the loop taught us about our own assumptions.

What PDM Predicts

Six specialized heads, all scored against a frozen prepare.py harness:

Head	Task	Output
`next_action`	Top-k next actions	Softmax over 50 actions (reply_email, check_ci, fix_bug, ...)
`routing`	Response lane	Softmax over 5 lanes: `ask / suggest / draft / act / escalate`
`proactive`	Should intervene?	Binary + calibrated probability
`delay`	Time until next action	Non-negative minutes (softplus regression)
`relevance`	Memory artifact ranking	Pairwise scoring
`calibration`	Confidence calibration	Temperature per head

A weighted composite score combines the five evaluated heads:

composite = 0.30·next_action_top1 + 0.25·routing_accuracy
          + 0.20·relevance_precision + 0.15·proactive_acceptance
          + 0.10·(1 - calibration_error)

Quick Start

# 1. Clone and install
git clone https://github.com/YOUR_ORG/pdm
cd pdm
pip install -e ".[dev]"

# 2. Generate synthetic training data (deterministic, seed=42)
python -c "
from training.synthetic_data import KnowledgeWorkerPersona
import json
for days, name in [(30,'30d'),(90,'90d'),(365,'365d'),(730,'730d')]:
    events = KnowledgeWorkerPersona(seed=42).generate(days=days)
    with open(f'datasets/synthetic_{name}.jsonl', 'w') as f:
        for e in events: f.write(json.dumps(e) + '\n')
"

# 3. Train (encoder-only + curriculum; ~10 min on MPS/CUDA, ~30 min on CPU)
python train.py

# 4. Start the playground + sidecar
uvicorn pdm_service.main:app --port 8787
# → http://127.0.0.1:8787/playground — interactive side-by-side model comparison
# → http://127.0.0.1:8787/docs       — OpenAPI docs for /predict, /rank, /route

# 5. Run the autoresearch loop (~3 hours for 20 experiments)
python autoresearch.py

The Playground

PDM ships with a web UI at http://127.0.0.1:8787/playground for interactively comparing any two model checkpoints on predefined scenarios:

6 predefined scenarios (ci_failure, morning_email, deep_coding, afternoon_meetings, pr_review, end_of_day) — each anchored to a plausible time-of-day
Side-by-side next-action, routing, and proactive predictions
Dropdown to pick any two checkpoints — try best_model_round1_backbone vs best_model_round7 to see the 30% composite improvement in practice

Project Layout

pdm/
├── README.md                    — You are here
├── LICENSE                      — Apache 2.0
├── CHANGELOG.md                 — Version history
├── CONTRIBUTING.md              — How to contribute, autoresearch workflow
├── program.md                   — Autoresearch rules (what can/can't change)
├── pyproject.toml
│
├── train.py                     — Consolidated training script (encoder-only default)
├── autoresearch.py              — Karpathy-style hyperparameter sweep
├── prepare.py                   — IMMUTABLE evaluation harness
├── compare.py                   — Model comparison CLI + SCENARIOS dict
├── results.tsv                  — Experiment leaderboard
│
├── models/
│   ├── event_encoder.py         — 2-layer transformer, causal + last-token pooling
│   ├── prediction_heads.py      — 6 heads: next_action, routing, proactive, delay, relevance, calibration
│   ├── pdm_model.py             — Wires everything together
│   └── backbone.py              — Optional Gemma 3 1B + LoRA (not recommended)
│
├── training/
│   ├── synthetic_data.py        — KnowledgeWorkerPersona + label rules
│   ├── dataset.py               — PdmDataset, window encoding
│   └── trainer.py               — compute_loss, training loop utilities
│
├── pdm_service/
│   ├── main.py                  — FastAPI app
│   ├── api/
│   │   ├── routes_predict.py    — /predict, /rank, /route
│   │   ├── routes_train.py      — /train async background job
│   │   ├── routes_evaluate.py   — /evaluate
│   │   └── routes_playground.py — /playground HTML + /api/compare
│   └── inference/
│       ├── model_registry.py    — Model loading + A/B comparison cache
│       └── predictor.py         — Inference wrapper
│
├── tests/                       — 61 tests covering model, data, API, training
├── docs/
│   ├── AUTORESEARCH_JOURNEY.md  — THE narrative — read this
│   └── ML_CONCEPTS.md           — LoRA, epochs, composite score explained
├── datasets/                    — Synthetic JSONL (gitignored, regenerated)
├── checkpoints/                 — Model weights (gitignored)
└── eval_sets/                   — Held-out evaluation data

Architecture

128 events → [Event Encoder]           2-layer transformer, causal attention
                │                      last-token pooling
                ↓
          [6 Prediction Heads]
                │
                ├─→ next_action (50-way softmax)
                ├─→ routing     (5-way softmax, class-weighted)
                ├─→ proactive   (binary sigmoid)
                ├─→ delay       (softplus regression, minutes)
                ├─→ relevance   (pairwise scoring)
                └─→ calibration (temperature per head)

Total parameters: ~2.7M trainable, checkpoint size: ~10 MB, load time: instant.

An optional Gemma 3 1B backbone path exists (USE_BACKBONE = True in train.py) but is not recommended — it was shown to underperform the encoder-only path by 30% composite while using 4 GB of RAM. It's kept for reproducibility of the Round 1 experiments.

Autoresearch

PDM uses a Karpathy-style research loop:

prepare.py defines the metric and is immutable
autoresearch.py iterates over experiments (defined as Experiment dataclass instances)
For each experiment: rewrite train.py config from a template, run python train.py, parse result
Log to results.tsv; keep the checkpoint if composite improved
Restore train.py; move to next experiment

See program.md for the formal rules of what can and cannot be modified inside an autoresearch run.

Important: the loop only explores axes you put in it. Round 1 missed the entire USE_BACKBONE = False branch because the template hardcoded it to True. We only discovered encoder-only was better when we ablated the axis manually. The current template in Round 2+ has use_backbone and use_curriculum as first-class Experiment fields. Lesson: design your search space to include the architectural assumptions you'd otherwise miss.

Known Limitations

Synthetic data only. All 7 rounds are evaluated on events generated by KnowledgeWorkerPersona, a deterministic persona simulator. No real user data has been tested. How the patterns transfer to real event streams is an open question.
Next-action top-1 is ~15%. For a 50-class problem that's 7× random (~2%) but still leaves a lot of headroom. A bigger encoder, longer training, or retrieval-augmented variant could probably push this higher.
Routing accuracy is ~0.35. Class-weighted cross-entropy (sqrt-inverse frequency) improved this from near-majority-class collapse, but rare classes (escalate, ask) still have limited training signal.
Timing head is new in Round 7. It learns the regression but the absolute delay predictions are noisy. Treat the timing_accuracy metric as a directional signal, not a ground truth.
Proactive head is new in Round 7. Labels are synthesized from event importance; real acceptance patterns are unknown.
prepare.py amendment in Round 7. We made one deliberate amendment to the immutable harness (read path changed to use DelayHead output). Metric formulas and composite weights stayed identical. See the journey doc for rationale.
No CI/CD. Tests exist but there's no GitHub Actions workflow yet.
No PyPI release. pyproject.toml is set up for local dev installs.

Development

# Run all tests (61 tests, ~3 seconds)
pytest tests/ -v

# Train a single experiment
python train.py

# Run the full autoresearch sweep (~3 hours, 20 experiments)
python autoresearch.py

# Start the sidecar with auto-reload for development
uvicorn pdm_service.main:app --reload --port 8787

# Compare models via CLI
python compare.py --scenario ci_failure
python compare.py --all-scenarios

See CONTRIBUTING.md for the full contributor guide.

License

Apache 2.0. See LICENSE.

Acknowledgments

Gemma 3 1B (Google) — used as a feature-extraction backbone in the Round 1 experiments. The autoresearch loop ultimately found it was not the right choice for this task, but the ablation was the most valuable outcome of the whole project.
The Karpathy-style autoresearch pattern — iterating on training recipes against a frozen evaluation harness — this repo is an existence proof that the pattern works on small-scale problems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDM — Personal Dynamics Model

The Scoreboard

What PDM Predicts

Quick Start

The Playground

Project Layout

Architecture

Autoresearch

Known Limitations

Development

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
datasets		datasets
docs		docs
eval_sets		eval_sets
models		models
pdm_service		pdm_service
tests		tests
training		training
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
autoresearch.py		autoresearch.py
compare.py		compare.py
prepare.py		prepare.py
program.md		program.md
pyproject.toml		pyproject.toml
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

PDM — Personal Dynamics Model

The Scoreboard

What PDM Predicts

Quick Start

The Playground

Project Layout

Architecture

Autoresearch

Known Limitations

Development

License

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages