SMO — State Memory Optimizer

SMO is a research repository for memory-efficient optimizer experiments in PyTorch.

The core idea is to reduce optimizer-state memory by compressing first- and second-order moments, with current work spanning:

Spatial optimizer-state compression (SMO-Spatial, SMO-8bit)
Spectral optimizer-state compression (SMO-Walsh, SMO-DCT) — experimental
Activation-memory compression — early experiments
Triton kernels for NVIDIA GPUs — future work

🎯 Key Results (CPU, seed=1234)

Memory Savings vs Accuracy Trade-off

Dataset	Model	Adam (baseline)	SMO k=0.25	SMO-8bit k=0.25
MNIST	SimpleCNN	99.04% / 3.22 MB	98.90% / 0.35 MB (89.1% savings)	98.97% / 0.21 MB (93.5%)
CIFAR-10	CIFAR_CNN	66.91% / 4.74 MB	63.59% / 0.99 MB (79.1%)	—
MiniGPT	4-layer Transformer	PPL 65.31 / 6.21 MB	—	PPL 66.58 / 0.43 MB (93.1%)

Spectral Variants (CIFAR-10, 3 epochs):

Walsh Pure k=0.5: 64.47% acc / 1.74 MB → 63.3% savings, beats Adam by +1.44%
Walsh Hybrid k=0.5: 61.24% / 1.74 MB (−1.79%)
DCT Hybrid k=0.5: 62.74% / 1.74 MB (−0.29%)
DCT Pure k=0.5: 63.10% / 1.74 MB → 63.3% savings, (−0.07% vs Adam, now fixed)

Takeaway: Spatial+8bit compression achieves >90% memory reduction on simple/medium tasks with minimal accuracy loss. On harder tasks (CIFAR-10) trade-off is larger (−3% at k=0.25) but improves with less aggressive compression (−2% at k=0.5). Surprisingly, Walsh Pure outperforms Adam on CIFAR-10 — worth deeper investigation.

📂 Repository Structure

supermario_optimizer/
├── smo/
│   ├── optimizers/      # Stable: SMO-Spatial, SMO-Spatial-8bit
│   ├── activations/     # Experimental: activation compression
│   ├── experimental/    # Research: spectral (Walsh, DCT), Triton
│   └── utils/           # Shared utilities
├── benchmarks/
│   ├── suites/          # Canonical benchmark entrypoints
│   │   ├── optimizer_step/   # Microbench: step time, memory
│   │   ├── training/         # End-to-end: MNIST, CIFAR-10, MiniGPT
│   │   ├── activations/      # Activation compression tests
│   │   ├── spectral/         # Spectral variant baselines
│   │   └── comparison/       # Multi-seed, multi-hardware
│   ├── runners/         # Hardware-specific launchers (Modal, DirectML)
│   ├── results/         # JSON outputs (aggregate + per-run)
│   ├── METHODOLOGY.md   # Benchmarking standards
│   └── CATALOG.md       # Inventory of all suites
├── profiles/            # Profiling scripts (torch.profiler)
├── docs/
│   ├── PROJECT_FOUNDATION.md  # Naming, taxonomy, layering
│   └── ROADMAP.md             # Phase-by-phase progress
├── tests/               # Unit tests
├── CHANGELOG.md         # Version history
└── TODO.md             # Current task tracking

📋 Documentation

Project Foundation — naming decisions, variant taxonomy, repository organization
→ docs/PROJECT_FOUNDATION.md
Roadmap — Phase 1–6 progress, current status, next actions
→ docs/ROADMAP.md
Benchmark Methodology — seeding policy, hardware matrix, result format
→ benchmarks/METHODOLOGY.md
Benchmark Catalog — canonical suites, categories, legacy wrappers
→ benchmarks/CATALOG.md
Changelog — versioned releases and unreleased changes
→ CHANGELOG.md
Task Tracking — in-progress work and blockers
→ TODO.md

🚀 Quick Start

Running Benchmarks

# Microbenchmark: optimizer step time (multi-shape, multi-seed)
python -m benchmarks.suites.optimizer_step.benchmark_step_time \
    --shapes 256,512,1024 --seeds 1234,5678,9012 --device cpu

# End-to-end: MNIST (SMO vs Adam)
python -m benchmarks.suites.training.benchmark_mnist --epochs 5 --seed 1234

# End-to-end: CIFAR-10
python -m benchmarks.suites.training.benchmark_cifar10 --epochs 5 --seed 1234

# End-to-end: 8-bit variant (MNIST)
python -m benchmarks.suites.training.benchmark_8bit --epochs 5 --seed 1234

# End-to-end: MiniGPT smoke (200 iters)
python -m benchmarks.suites.training.benchmark_minillm --max_iters 200 --seed 1234

# Spectral variants baseline (CIFAR-10, 3 epochs)
python -m benchmarks.suites.spectral.benchmark_spectral_cpu --epochs 3 --seed 1234

# Profiling: step breakdown (CPU)
python profiles/profile_smo_step.py --shape 1024,1024 --steps 100 --seed 1234

All results saved to benchmarks/results/ as JSON bundles.

🧪 Current Status

Phase 2 — Correctness & Measurement: ✅ COMPLETE (2026-05-06)

Deterministic seeding across 12+ canonical benchmarks
Multi-seed aggregation (mean±std) in microbenchmarks
GPU sync infrastructure (CUDA/DirectML)
Memory accounting corrected (temp buffers excluded)
Baseline suite executed: MNIST, CIFAR-10, MiniGPT, Spectral variants

Phase 3 — Bottleneck Analysis: 🔄 AWAITING GPU

Profiling suite profiles/profile_smo_step.py ready (manual timers + torch.profiler)
CPU profiling data: compression 41%, upsampling 31%, update 3%
Blocked: local DirectML GPU >90% utilization

Phase 4 — Optimization Work:

✅ 4.1 Buffer reuse — 8–24% CPU speedup, memory accounting fixed
⏳ 4.2 Post-profiling — pending GPU data (pooling kernel eval, upsampling bypass)

📈 Performance Highlights

SMO-Spatial (CPU step time, after buffer reuse):

Shape	AdamW	SMO (antes)	SMO (después)	Mejora
256×256	1.43 ms	3.87 ms	2.93 ms	−24%
512×512	4.67 ms	9.97 ms	8.97 ms	−10%
1024×1024	20.16 ms	32.36 ms	29.86 ms	−8%

Memory Compression (k_ratio=0.25):

Task	Adam State	SMO State	Reduction
MNIST (2 linear layers)	3.22 MB	0.35 MB	89.1%
CIFAR-10 (2 linear + 3 conv)	4.74 MB	0.99 MB	79.1%
MiniGPT (8 linear)	6.21 MB	0.43 MB (8bit)	93.1%

🐛 Known Issues

Accuracy gap on complex tasks: CIFAR-10 gap −3.32% at k=0.25; consider k=0.5 for better quality (−2.08% gap)
GPU profiling pending: DirectML machine at high utilization; will run profiling when <50%

📜 License

MIT — see LICENSE for details.

🙋 Contributing

This is a research project. For issues, see the TODO tracking in TODO.md. For questions, open a GitHub issue.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.kilo		.kilo
benchmarks		benchmarks
docs		docs
profiles		profiles
smo		smo
spectral		spectral
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
TODO.md		TODO.md
benchmark_8bit_triton_cifar10_modal.py		benchmark_8bit_triton_cifar10_modal.py
benchmark_activations_modal.py		benchmark_activations_modal.py
benchmark_spectral_gpu_modal.py		benchmark_spectral_gpu_modal.py
example.py		example.py
modal_benchmark_triton.py		modal_benchmark_triton.py
paper_draft.md		paper_draft.md
setup.py		setup.py
test_8bit_triton_modal.py		test_8bit_triton_modal.py
test_gpu_directml.py		test_gpu_directml.py
test_modal_connection.py		test_modal_connection.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SMO — State Memory Optimizer

🎯 Key Results (CPU, seed=1234)

Memory Savings vs Accuracy Trade-off

📂 Repository Structure

📋 Documentation

🚀 Quick Start

Running Benchmarks

🧪 Current Status

📈 Performance Highlights

🐛 Known Issues

📜 License

🙋 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SMO — State Memory Optimizer

🎯 Key Results (CPU, seed=1234)

Memory Savings vs Accuracy Trade-off

📂 Repository Structure

📋 Documentation

🚀 Quick Start

Running Benchmarks

🧪 Current Status

📈 Performance Highlights

🐛 Known Issues

📜 License

🙋 Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages