Extreme AI compression meets perfect memory — locally, for free.
An open experiment combining two bleeding-edge ideas:
- TurboQuant/PolarQuant — Google Research's optimal vector quantization for AI models (paper, ICLR 2026)
- MemPalace — The highest-scoring AI memory system ever benchmarked (96.6% LongMemEval, repo)
The goal: run powerful AI models on cheap hardware with perfect long-term memory. No cloud. No subscription. Your machine, your data.
Right now, local AI has two problems:
- Models are too big — A good model needs 16-48GB VRAM. Most people don't have that.
- AI forgets everything — Every session starts from zero. Months of context, gone.
TurboQuant solves the first. MemPalace solves the second. Nobody has combined them yet.
turbomem/
├── README.md ← You are here
├── research/
│ ├── turboquant.md ← Paper analysis + implementation notes
│ └── mempalace.md ← Architecture analysis + integration plan
├── src/
│ ├── polarquant/ ← PolarQuant implementation (experimental)
│ │ ├── __init__.py
│ │ ├── rotation.py ← Random rotation matrices
│ │ ├── quantizer.py ← Optimal scalar quantization
│ │ └── polar.py ← Full PolarQuant pipeline
│ ├── qjl/ ← QJL 1-bit error correction
│ │ ├── __init__.py
│ │ └── qjl.py ← Johnson-Lindenstrauss transform
│ └── integration/ ← MemPalace + TurboQuant glue
│ └── __init__.py
├── benchmarks/ ← Reproducible benchmark scripts
│ └── README.md
├── examples/ ← Quick-start examples
│ └── basic_usage.py
├── requirements.txt
└── LICENSE
| Component | Status | Notes |
|---|---|---|
| PolarQuant (random rotation + quantization) | 🟡 Experimental | Based on paper, not Google's impl |
| QJL (1-bit error correction) | 🟡 Experimental | Core algorithm implemented |
| MemPalace integration | 🔴 Planned | Waiting for stable MCP interface |
| Ollama/llama.cpp bridge | 🔴 Planned | KV-cache compression hook |
| Benchmarks | 🔴 Planned | Need LongMemEval + perplexity tests |
This is an open challenge. We're looking for people who can:
- Implement — Turn the TurboQuant paper into working code
- Benchmark — Test compression vs accuracy on real models
- Integrate — Wire MemPalace's ChromaDB into the quantized pipeline
- Break it — Find edge cases, prove us wrong, make it better
-
🔥 PolarQuant rotation matrix — Implement the random orthogonal rotation that makes vectors quantization-friendly. The paper describes it, but optimal rotation selection is non-trivial.
-
🔥 QJL estimator — The 1-bit error correction needs a careful estimator that balances high-precision queries with low-precision data. Math-heavy, high impact.
-
🔥 KV-cache hook — Where do you intercept llama.cpp's KV-cache to inject TurboQuant compression? This is the integration challenge.
-
MemPalace MCP bridge — Connect MemPalace's memory retrieval to a quantized local model so the model remembers everything while running on minimal hardware.
git clone https://github.com/Snakkaz/turbomem.git
cd turbomem
pip install -r requirements.txt
python examples/basic_usage.py- TurboQuant: Optimal Vector Quantization — Meng et al., ICLR 2026
- PolarQuant: Quantization via Random Rotation — AISTATS 2026
- QJL: 1-bit Quantized Johnson-Lindenstrauss — AAAI 2025
- MemPalace — Jovovich & Sigman, 2026
- llama.cpp — Local LLM inference
TurboQuant compression + MemPalace memory = TurboMem. Fast models that never forget.
MIT — use it, fork it, improve it.
Built by Petersen Digital Consulting 🇳🇴 — Making local AI accessible.