Skip to content

janaraj/pdm

Repository files navigation

PDM — Personal Dynamics Model

Status: Experimental / Alpha v0.1 — research artifact, not a production system. Expect breaking changes, synthetic-data-only validation, and known rough edges. See Known Limitations before building on it.

PDM is a behavioral prediction oracle — a small model that takes a stream of structured events (emails, CI failures, calendar reminders, PR reviews, code commits...) and predicts what the user will do next, how an assistant should respond, and when to proactively intervene.

It is not a chat model. It is not a language model. It consumes structured event dicts, not text, and its output is a set of predictions over small fixed vocabularies (50 actions, 5 routing lanes, etc.).

The interesting thing about this repo isn't the model — it's the autoresearch journey that shaped it. A tiny Karpathy-style autoresearch loop, run against an immutable evaluation harness, led us to some counterintuitive conclusions:

  • A 10 MB encoder beats a 4 GB Gemma-backbone model at this task by 30%.
  • Data quality matters more than model size.
  • Pooling choice (mean vs causal + last-token) is an architecture decision hiding in what looks like an implementation detail.
  • An immutable prepare.py eval harness is the single most important design decision for keeping the research honest.

The Scoreboard

Seven rounds of autoresearch and deliberate architecture changes:

Round Change Composite Checkpoint Notes
1 Backbone (Gemma 3 1B + LoRA), 22 experiments 0.325 4 GB batch_16 winner
2 Encoder-only ablation 0.395 10 MB +21.5% composite, 400× smaller
3 Coherent synthetic data (event_type ↔ action) 0.410 10 MB Fixed a generator bug
4 730-day dataset (2× data) 0.420 10 MB ECE 0.053 → 0.021
5 Causal attention + last-token pooling 0.423 10 MB ECE 0.021 → 0.011
6 Curriculum windows [5, 16, 32, 64, 128] 0.408 10 MB Short-scenario generalization
7 Proactive labels + DelayHead + class-weighted routing 0.422 10 MB All three previously-dead metrics revived

The full story is in docs/AUTORESEARCH_JOURNEY.md. It's written to be read — not just a log, but a narrative of what we tried, what failed, and what the loop taught us about our own assumptions.


What PDM Predicts

Six specialized heads, all scored against a frozen prepare.py harness:

Head Task Output
next_action Top-k next actions Softmax over 50 actions (reply_email, check_ci, fix_bug, ...)
routing Response lane Softmax over 5 lanes: ask / suggest / draft / act / escalate
proactive Should intervene? Binary + calibrated probability
delay Time until next action Non-negative minutes (softplus regression)
relevance Memory artifact ranking Pairwise scoring
calibration Confidence calibration Temperature per head

A weighted composite score combines the five evaluated heads:

composite = 0.30·next_action_top1 + 0.25·routing_accuracy
          + 0.20·relevance_precision + 0.15·proactive_acceptance
          + 0.10·(1 - calibration_error)

Quick Start

# 1. Clone and install
git clone https://github.com/YOUR_ORG/pdm
cd pdm
pip install -e ".[dev]"

# 2. Generate synthetic training data (deterministic, seed=42)
python -c "
from training.synthetic_data import KnowledgeWorkerPersona
import json
for days, name in [(30,'30d'),(90,'90d'),(365,'365d'),(730,'730d')]:
    events = KnowledgeWorkerPersona(seed=42).generate(days=days)
    with open(f'datasets/synthetic_{name}.jsonl', 'w') as f:
        for e in events: f.write(json.dumps(e) + '\n')
"

# 3. Train (encoder-only + curriculum; ~10 min on MPS/CUDA, ~30 min on CPU)
python train.py

# 4. Start the playground + sidecar
uvicorn pdm_service.main:app --port 8787
# → http://127.0.0.1:8787/playground — interactive side-by-side model comparison
# → http://127.0.0.1:8787/docs       — OpenAPI docs for /predict, /rank, /route

# 5. Run the autoresearch loop (~3 hours for 20 experiments)
python autoresearch.py

The Playground

PDM ships with a web UI at http://127.0.0.1:8787/playground for interactively comparing any two model checkpoints on predefined scenarios:

  • 6 predefined scenarios (ci_failure, morning_email, deep_coding, afternoon_meetings, pr_review, end_of_day) — each anchored to a plausible time-of-day
  • Side-by-side next-action, routing, and proactive predictions
  • Dropdown to pick any two checkpoints — try best_model_round1_backbone vs best_model_round7 to see the 30% composite improvement in practice

Project Layout

pdm/
├── README.md                    — You are here
├── LICENSE                      — Apache 2.0
├── CHANGELOG.md                 — Version history
├── CONTRIBUTING.md              — How to contribute, autoresearch workflow
├── program.md                   — Autoresearch rules (what can/can't change)
├── pyproject.toml
│
├── train.py                     — Consolidated training script (encoder-only default)
├── autoresearch.py              — Karpathy-style hyperparameter sweep
├── prepare.py                   — IMMUTABLE evaluation harness
├── compare.py                   — Model comparison CLI + SCENARIOS dict
├── results.tsv                  — Experiment leaderboard
│
├── models/
│   ├── event_encoder.py         — 2-layer transformer, causal + last-token pooling
│   ├── prediction_heads.py      — 6 heads: next_action, routing, proactive, delay, relevance, calibration
│   ├── pdm_model.py             — Wires everything together
│   └── backbone.py              — Optional Gemma 3 1B + LoRA (not recommended)
│
├── training/
│   ├── synthetic_data.py        — KnowledgeWorkerPersona + label rules
│   ├── dataset.py               — PdmDataset, window encoding
│   └── trainer.py               — compute_loss, training loop utilities
│
├── pdm_service/
│   ├── main.py                  — FastAPI app
│   ├── api/
│   │   ├── routes_predict.py    — /predict, /rank, /route
│   │   ├── routes_train.py      — /train async background job
│   │   ├── routes_evaluate.py   — /evaluate
│   │   └── routes_playground.py — /playground HTML + /api/compare
│   └── inference/
│       ├── model_registry.py    — Model loading + A/B comparison cache
│       └── predictor.py         — Inference wrapper
│
├── tests/                       — 61 tests covering model, data, API, training
├── docs/
│   ├── AUTORESEARCH_JOURNEY.md  — THE narrative — read this
│   └── ML_CONCEPTS.md           — LoRA, epochs, composite score explained
├── datasets/                    — Synthetic JSONL (gitignored, regenerated)
├── checkpoints/                 — Model weights (gitignored)
└── eval_sets/                   — Held-out evaluation data

Architecture

128 events → [Event Encoder]           2-layer transformer, causal attention
                │                      last-token pooling
                ↓
          [6 Prediction Heads]
                │
                ├─→ next_action (50-way softmax)
                ├─→ routing     (5-way softmax, class-weighted)
                ├─→ proactive   (binary sigmoid)
                ├─→ delay       (softplus regression, minutes)
                ├─→ relevance   (pairwise scoring)
                └─→ calibration (temperature per head)

Total parameters: ~2.7M trainable, checkpoint size: ~10 MB, load time: instant.

An optional Gemma 3 1B backbone path exists (USE_BACKBONE = True in train.py) but is not recommended — it was shown to underperform the encoder-only path by 30% composite while using 4 GB of RAM. It's kept for reproducibility of the Round 1 experiments.


Autoresearch

PDM uses a Karpathy-style research loop:

  1. prepare.py defines the metric and is immutable
  2. autoresearch.py iterates over experiments (defined as Experiment dataclass instances)
  3. For each experiment: rewrite train.py config from a template, run python train.py, parse result
  4. Log to results.tsv; keep the checkpoint if composite improved
  5. Restore train.py; move to next experiment

See program.md for the formal rules of what can and cannot be modified inside an autoresearch run.

Important: the loop only explores axes you put in it. Round 1 missed the entire USE_BACKBONE = False branch because the template hardcoded it to True. We only discovered encoder-only was better when we ablated the axis manually. The current template in Round 2+ has use_backbone and use_curriculum as first-class Experiment fields. Lesson: design your search space to include the architectural assumptions you'd otherwise miss.


Known Limitations

  • Synthetic data only. All 7 rounds are evaluated on events generated by KnowledgeWorkerPersona, a deterministic persona simulator. No real user data has been tested. How the patterns transfer to real event streams is an open question.
  • Next-action top-1 is ~15%. For a 50-class problem that's 7× random (~2%) but still leaves a lot of headroom. A bigger encoder, longer training, or retrieval-augmented variant could probably push this higher.
  • Routing accuracy is ~0.35. Class-weighted cross-entropy (sqrt-inverse frequency) improved this from near-majority-class collapse, but rare classes (escalate, ask) still have limited training signal.
  • Timing head is new in Round 7. It learns the regression but the absolute delay predictions are noisy. Treat the timing_accuracy metric as a directional signal, not a ground truth.
  • Proactive head is new in Round 7. Labels are synthesized from event importance; real acceptance patterns are unknown.
  • prepare.py amendment in Round 7. We made one deliberate amendment to the immutable harness (read path changed to use DelayHead output). Metric formulas and composite weights stayed identical. See the journey doc for rationale.
  • No CI/CD. Tests exist but there's no GitHub Actions workflow yet.
  • No PyPI release. pyproject.toml is set up for local dev installs.

Development

# Run all tests (61 tests, ~3 seconds)
pytest tests/ -v

# Train a single experiment
python train.py

# Run the full autoresearch sweep (~3 hours, 20 experiments)
python autoresearch.py

# Start the sidecar with auto-reload for development
uvicorn pdm_service.main:app --reload --port 8787

# Compare models via CLI
python compare.py --scenario ci_failure
python compare.py --all-scenarios

See CONTRIBUTING.md for the full contributor guide.


License

Apache 2.0. See LICENSE.

Acknowledgments

  • Gemma 3 1B (Google) — used as a feature-extraction backbone in the Round 1 experiments. The autoresearch loop ultimately found it was not the right choice for this task, but the ablation was the most valuable outcome of the whole project.
  • The Karpathy-style autoresearch pattern — iterating on training recipes against a frozen evaluation harness — this repo is an existence proof that the pattern works on small-scale problems.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages