imodels-evolve

Autonomous agentic discovery of interpretable scikit-learn regressors.

Quick start • Repo layout • Discovered models • How the loop works • Paper

A coding agent (Claude Code or OpenAI Codex) is given a fixed evaluation harness and a single Python file. It then iteratively rewrites that file to jointly optimize:

Predictive performance — mean RMSE rank across 65 tabular regression datasets (TabArena + PMLB).
Interpretability — fraction of LLM-graded tests passed, covering feature attribution, point simulation, sensitivity, counterfactuals, structural questions, and complex-function probes against the model's __str__ output.

The result is a library of scikit-learn-compatible regressors whose string representations are explicitly optimized to be read by another LLM — interpretable by agents, not just by humans. From ~700 evolved candidates we curate 10 Pareto models in agentic-imodels.

Quick start

Requirements: Python 3.10+ and uv.

git clone https://github.com/csinva/imodels-evolve
cd imodels-evolve
uv sync

Use the curated discovered models

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from agentic_imodels import HingeEBMRegressor

X, y = fetch_california_housing(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=0)

model = HingeEBMRegressor()
model.fit(X_tr, y_tr)

print(model)               # human/LLM-readable equation
preds = model.predict(X_te)

Every estimator follows the standard BaseEstimator + RegressorMixin contract, so it drops into Pipeline, cross_val_score, GridSearchCV, etc.

Run the agentic loop yourself

cd evolve
uv run run_baselines.py            # establish baseline scores
uv run interpretable_regressor.py  # one experiment iteration

Then point a coding agent at evolve/program.md:

Read and follow the instructions in evolve/program.md.

The agent edits evolve/interpretable_regressor.py in a loop, commits each attempt, and logs to evolve/results/overall_results.csv. See evolve/readme.md for the full protocol. A parallel setup for Codex lives in evolve_codex/.

Repo layout

Folder	Purpose
`evolve/`	The Claude-driven agentic loop — fixed harness (`run_baselines.py`, `src/`), agent-edited model file (`interpretable_regressor.py`), agent prompt (`program.md`).
`evolve_codex/`	Same loop, OpenAI Codex agent.
`result_libs/`	Raw per-run output: every regressor the agent wrote during each loop, grouped by date / agent / effort. Includes `combined_results.csv` and `pareto_evolved.csv` aggregating all runs.
`result_libs_processed/agentic-imodels/`	Curated, installable Python package of 10 Pareto-frontier models drawn from `result_libs/`.
`generalization_experiments/`	Re-evaluate evolved models on new OpenML regression suites and a new 157-test interpretability suite to check generalization.
`e2e_experiments/`	Downstream end-to-end evaluation: equip Claude Code, Codex, and Copilot CLI with the evolved models and measure their performance on the BLADE benchmark.
`paper-imodels-agentic/`	NeurIPS 2026 paper source (`main.tex`, `figures/`, `tables/`).

Discovered models

Curated highlights from agentic-imodels. Rank is mean global RMSE rank across 65 dev datasets (lower is better). Test interp is fraction passed on the held-out 157-test generalization suite. Reference points: TabPFN baseline rank 94.5 / test interp 0.17; OLS baseline rank 354.5 / test interp 0.69.

Class	Rank ↓	Dev interp ↑	Test interp ↑	Idea
`HingeEBMRegressor`	108.2	0.65	0.71	Lasso on hinge basis + hidden EBM on residuals; sparse linear display.
`DistilledTreeBlendAtlasRegressor`	139.7	1.00	0.71	Ridge student distilled from GBM+RF teachers, shown with a probe-answer "atlas" card.
`DualPathSparseSymbolicRegressor`	163.5	0.70	0.71	GBM/RF/Ridge blend for `predict`, sparse symbolic equation for display.
`HybridGAM`	163.8	0.72	0.68	SmartAdditiveGAM display + hidden RF residual corrector.
`TeacherStudentRuleSplineRegressor`	204.0	0.61	0.80	GBM teacher + sparse symbolic student over hinge/step/interaction terms.
`SparseSignedBasisPursuitRegressor`	272.7	0.67	0.76	Forward-selected signed basis (linear/hinge/square/interaction) + ridge refit.
`HingeGAMRegressor`	280.2	0.56	0.78	Pure Lasso on a 10-breakpoint hinge basis; predict = display.
`WinsorizedSparseOLSRegressor`	326.9	0.65	0.73	Clip features to `[p1, p99]`, LassoCV select top-8, OLS refit.
`TinyDTDepth2Regressor`	334.0	0.67	0.71	Depth-2 decision tree (4 leaves).
`SmartAdditiveRegressor`	354.3	0.74	0.73	Adaptive-linearization GAM — Laplacian-smoothed boosted stumps per feature.

Two stylistic patterns emerge:

Display-predict decoupled (HingeEBM, HybridGAM, DistilledTreeBlendAtlas, DualPathSparseSymbolic, TeacherStudentRuleSpline) — a hidden corrector improves prediction while __str__ stays a clean formula. Pick these for the lowest predictive rank.
Honest (SmoothGAM, HingeGAM, WinsorizedSparseOLS, SparseSignedBasisPursuit, TinyDT) — predict and __str__ agree, no silent corrector. Pick these when the printed formula must actually be what runs.

How the loop works

        +-----------------------------+
        | program.md (agent prompt)   |
        +-----------------------------+
                       |
                       v
   +----------------------------------------+
   | edit interpretable_regressor.py        |  <-- only file the agent touches
   +----------------------------------------+
                       |
                       v
   +----------------------------------------+
   | run_baselines.py / src/performance_eval|  predictive performance (rank)
   | src/interp_eval.py                     |  43 LLM-graded interp tests
   +----------------------------------------+
                       |
                       v
   +----------------------------------------+
   | results/overall_results.csv            |  keep / discard / crash
   +----------------------------------------+
                       |
                       └──> next iteration

Each iteration is a single git commit. Both metrics matter — neither is a hard constraint. The agent is asked to find Pareto improvements over a strong baseline panel (OLS, Lasso, RidgeCV, EBM, RandomForest, GBM, TabPFN, …). See evolve/program.md for the exact protocol the agent follows.

Generalization & end-to-end results

Generalization (generalization_experiments/): the evolved models retain their Pareto advantage on new OpenML regression suites and on a new 157-test interpretability suite written from scratch (separate from the 43 dev tests).
End-to-end ADS (e2e_experiments/): plugging the evolved models into Claude Code, Codex, and Copilot CLI improves their scores on the BLADE end-to-end data-science benchmark by 8%–47% vs. standard interpretability tools.

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
e2e_experiments		e2e_experiments
evolve		evolve
evolve_codex		evolve_codex
generalization_experiments		generalization_experiments
result_libs		result_libs
result_libs_processed		result_libs_processed
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
claude.md		claude.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

imodels-evolve

Quick start

Use the curated discovered models

Run the agentic loop yourself

Repo layout

Discovered models

How the loop works

Generalization & end-to-end results

Related

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

imodels-evolve

Quick start

Use the curated discovered models

Run the agentic loop yourself

Repo layout

Discovered models

How the loop works

Generalization & end-to-end results

Related

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages