Skip to content

Sandermage/sndr_core_engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SNDR Core Engine — Genesis vLLM Patches

SNDR Core Engine

Genesis vLLM Patches — runtime patch-overlay for vLLM on consumer NVIDIA Ampere / Ada / Blackwell.

License: Apache 2.0 vLLM pin Patches SNDR Core GPU

Runtime patches for vLLM — Qwen3.6-class inference on consumer NVIDIA Ampere / Ada / Blackwell with TurboQuant k8v4 KV cache, MTP K=5 spec-decode, tool-calling, and 256K-class context. 321 patches across ~23 families. Apache 2.0.


🚀 New here?docs/GETTING_STARTED.md — who it's for, what you get, and the one install line. 🧠 New to local AI?docs/LOCAL_AI_PRIMER.md — GPUs, engines, MoE, and quants in plain English. 📖 Hit an unfamiliar term (TPS · KV · MTP · TurboQuant · GDN)? → docs/GLOSSARY.md. 💸 Self-host or cloud?docs/COMPARISONS.md — the cost-crossover trade.

What it is

A drop-in runtime patcher for vLLM. It pins to a specific vLLM nightly commit and applies 321 small, surgical changes — text edits at known anchors, class-rebind wrappers, and FastAPI middleware — that together turn an out-of-the-box vLLM into a production-grade Qwen3.6 inference server on consumer NVIDIA hardware (3090, 4090, 5090, A5000, A6000, …) where vLLM upstream mostly targets datacenter SKUs.

It is not a fork of vLLM, a quantizer, a new inference engine, or a training framework. Patches retire automatically when upstream merges the underlying fix.

How it works

The overlay / apply model. Genesis never edits vLLM on disk. At every process start the plugin registers via vLLM's vllm.general_plugins entry point (loaded in the main process, the engine, and every worker rank) and the dispatcher walks PATCH_REGISTRY. Each patch declares an applies_to version range and an apply method — a text edit at a unique source anchor, a class-rebind wrapper, or FastAPI middleware. Patches whose anchors match and whose range covers the live pin apply; the rest print [SKIP — applies_to mismatch] and no-op. The result is an in-memory overlay: the same wheel, transformed at boot, with a structured apply summary (applied=N skipped=M failed=0) and an audit trail. Nothing is written to the vLLM package tree.

Patch families. The 321 entries group into ~23 canonical families. The largest: attention.turboquant (k8v4 KV-cache quant), spec_decode (MTP / ngram speculative decoding), attention.gdn (hybrid Gated-DeltaNet linear attention), gemma4 (Gemma-4 enablement), kv_cache, compile_safety, worker, serving, tool_parsing, and moe. The full table is docs/PATCHES.md (curated) + docs/PATCHES_AUTO.md (generated from the registry).

Pin lifecycle. Genesis pins to one canonical vLLM nightly at a time, plus an optional previous pin held for rollback during validation — at most two ("≤2-pin policy"). A bump happens only on an explicit instruction naming the target pin; there are no proactive pulls. The candidate is validated before promotion (anchor-drift resolved, the bump-preflight gate clean, boot-smoke + tokenizer-fingerprint + canonical bench), then the old 2-back pin is dropped. Current: dev424 (3f5a1e173); rollback: dev301 (04c2a8dea). See docs/PIN_BUMP_PLAYBOOK.md (canonical) + docs/ANCHOR_SOT.md.

Model catalog (current registry).

Model Quant KV cache Spec-decode Status
Qwen3.6-35B-A3B-FP8 FP8 dense MoE TurboQuant k8v4 MTP K=5 ✅ PROD (default)
Qwen3.6-27B-int4-AutoRound INT4 AutoRound (hybrid GDN+Mamba) TurboQuant k8v4 MTP K=5 ✅ PROD
Gemma-4-31B INT4 / kv-auto TurboQuant or uniform fp16 MTP K=3 (separate drafter) ⚙️ boots + patches apply; serving needs MM-budget config
DiffusionGemma-26B-A4B-FP8 FP8-dynamic block-diffusion MoE TP=2 ✅ serving at TP=2

Per-model deep-dives + the V2 layered config system: docs/MODELS.md. Hardware envelope: docs/HARDWARE.md.

Launching. Boot any model through a preset — the launcher resolves the preset, runs preflight, and renders the docker run (or podman / bare-metal / k8s) command for you with the correct pin, mounts, and env:

sndr launch prod-qwen3.6-35b-balanced            # boot a preset
sndr launch prod-qwen3.6-35b-balanced --dry-run  # inspect the rendered command, no boot

Full operator manual: docs/USAGE.md.

Headline numbers (v12.0.0 current registry)

Reference rig: 2× RTX A5000 24 GB (Ampere SM 8.6), driver 580.142, CUDA 13.0.2, MTP K=5 + TurboQuant k8v4, TP=2.

Model Stock vLLM Genesis (v12.0.0) Δ
Qwen3.6-35B-A3B-FP8 (single-conc, K=5) ~157 t/s 239.7 t/s +53 %
Qwen3.6-35B-A3B-FP8 (8-way multi-conc, K=3) n/a ~675 t/s agg 3.21× scaling
Qwen3.6-27B-int4-AutoRound (single-conc, K=5) ~87 t/s 127.4 t/s +46 %
Tool-call clean rate (35B / 27B) 2–6 / 10 7/7 · 8/8 qualitative

256K context hardware-verified on both models. Full methodology, historical comparisons, and per-rig reproduction recipes: docs/BENCHMARKS.md.

Sustained TPS — Genesis vs stock

Current pin (2026-06-25): the vLLM pin is 0.23.1rc1.dev424+g3f5a1e173 (image vllm/vllm-openai:nightly-3f5a1e173…, commit 3f5a1e173, +123 commits over dev301). dev301 (0.23.1rc1.dev301+g04c2a8dea, commit 04c2a8dea) is retained as the previous / rollback pin per the ≤2-pin policy; dev148 is dropped. The dev301→dev424 bump was the first to dogfood the anchor-SOT bump tooling (make bump-preflight + retire_impact) — see docs/PIN_BUMP_PLAYBOOK.md (canonical) and docs/ANCHOR_SOT.md. The per-model bench table below is the validated dev148 K=5 re-tune cycle (still the canonical sustained-bench evidence; decode carried forward across the dev301 and dev424 bumps with no regression).

Validated rig baseline — 2026-06-19 (measured on pin 0.23.1rc1.dev148+gb4c80ec0f)

Full model-cycle re-test on the reference 2× A5000 rig after the MTP K=3→K=5 re-tune. These are the canonical sustained-bench numbers; the pin has since bumped dev148 → dev301 → dev424 (current) with the decode results carried forward (no regression — anchor regen confirmed at each bump). Each model boots the Genesis apply pipeline, applies its patch set, and is benchmarked / smoke-tested live (tools/genesis_bench_suite.py, single-stream warm sweep). The 35B / 27B single-stream rows are the K=5 re-tune numbers; Gemma stays K=3 (its separate drafter is optimal at K=3).

Model Quant / KV Patches Decode TPS Tool-call Status
Qwen3.6-35B-A3B-FP8 FP8 dense · TQ k8v4 · MTP K=5 95 239.7 (CV 4.9 %) 7/7 ✅ serving — +15.8 % vs K=3
Qwen3.6-27B-int4-AutoRound INT4 AutoRound · TQ k8v4 · MTP K=5 93 127.4 (CV 8.3 %) 7/7 ✅ serving — +8.2 % vs K=3
Gemma-4-31B INT4 · TQ k8v4 · MTP K=3 81 ⚙️ boots + patches apply; serving needs MM-budget config (multimodal-bidirectional × spec-decode)
DiffusionGemma-26B-A4B-FP8 FP8-dynamic · block-diffusion · TP=2 45 coherent serving at TP=2PN-FP8MOE-KPAD (Marlin N=352) + G4_26 (TP-vocab soft-embed); enforce-eager · max-num-seqs 2 · gpu-util 0.80

The 35B and 27B clear their historical peak band — the K=5 re-tune lifts single-stream decode to 239.7 / 127.4 t/s (+15.8 % / +8.2 % vs K=3) within CV → the v12 platform carries no decode regression. PN-FP8MOE-KPAD (backport of open vLLM PR #45703, model-agnostic Marlin-MoE intermediate-pad) plus G4_26 (backport of #45774, DiffusionGemma TP>1 vocab-sharded soft-embed all-gather) make DiffusionGemma the first block-diffusion FP8-MoE checkpoint to boot AND serve coherently at TP=2 on consumer Ampere without a kernel rebuild — validated 2026-06-17 (clears the Marlin N=352 thread-tile crash, then the probs @ embed_weight [131072,2816] TP-vocab shape mismatch; the coherent generation confirms the soft-embed all-gather yields correct TP=2 output).

Pick your path

You have Start here
1× consumer card (3090 / 4090 / 5090 / A5000) docs/SINGLE_CARD.md
2× cards (TP=2 — the reference topology) docs/HARDWARE.md + docs/MODELS.md
A model not in the catalog docs/MODELS.md (add-a-model + the V2 config system)
Brand-new / weighing self-host vs cloud docs/GETTING_STARTED.md · docs/COMPARISONS.md

Quick install

curl -sSL https://raw.githubusercontent.com/Sandermage/sndr_core_engine/main/install.sh | bash

The installer detects your OS / Python / GPU / vLLM presence, clones into ~/.sndr/, installs the plugin, writes a tailored launch script, and runs a 60-second smoke test. Five-minute walk-through and Day-1 acceptance steps: docs/QUICKSTART.md.

To pick a different vLLM pin, workload, or non-interactive flag set: docs/INSTALL.md.

Documentation map

If you want to... Read
One-page operator manual (installer → launcher → configs → patches) docs/USAGE.md
Install + first boot docs/INSTALL.mddocs/QUICKSTART.md
Browse sndr commands docs/CLI_REFERENCE.md
Pick a model + hardware combo docs/MODELS.md + docs/HARDWARE.md
Tune an env-var flag docs/CONFIGURATION.md
Browse the patch catalogue + compatibility matrix docs/PATCHES.md
Diagnose an OOM, cliff, or boot failure docs/TROUBLESHOOTING.md
Roll a broken release back docs/TROUBLESHOOTING.md
See current bench numbers + reproduce docs/BENCHMARKS.md
Author a patch or community plugin docs/CONTRIBUTING.md
Sponsorship / hardware loan / business invoicing docs/SPONSORS.md
Disclose a security issue SECURITY.md

Full docs index: docs/README.md.

Repository structure

The layout separates the shippable engine from the maintainer tooling and vendored third-party code, so the published wheel stays small and the apply pipeline stays auditable.

Path What it is
sndr/ The engine. The PATCH_REGISTRY + dispatcher, the apply pipeline (text-anchor / class-rebind / middleware patchers), per-engine patch sets (sndr/engines/vllm/...), the V2 layered model-config system, the universal launcher, the CLI (sndr/genesis), and the read-only product API the GUI consumes. This is the only tree the Apache wheel ships.
gui/ The control center — a desktop/web front-end (gui/web, gui/desktop) that drives the sndr product API: launch presets, inspect the live apply summary, browse the patch catalogue, run benches, and manage remote hosts. Built static assets are served by the product API.
tests/ The pytest suite (13k+ collected). Unit tests per subsystem under tests/unit/..., contract/bundle/proof tests, and the load-bearing CI gate. Excluded from the wheel.
docs/ All public documentation (USAGE, INSTALL, MODELS, HARDWARE, PATCHES, BENCHMARKS, the pin-bump playbook, anchor SOT, …). docs/README.md is the index.
scripts/ + tools/ Maintainer tooling — the audit gates (make gates), doc-sync / link / attribution / drift checkers, anchor-SOT regeneration, bench harnesses, and pin-bump preflight. Not shipped in the wheel.
third_party/ Vendored upstream kernel source (a curated subset of TurboMind's int4 grouped-MoE GEMM, used by the experimental G4_85 MoE kernel patch). See third_party/tm_int4_moe/README.md for provenance + license.
compose/ Reference docker-compose files for the canonical prod presets (35B / 27B, single- and multi-concurrency, long-context).
benchmarks/ + evidence/ Bench harness/data and per-patch proof artefacts (evidence/patch_proof/) plus the A/B validation evidence the registry cites for default-on/off decisions.
schemas/ + plugins/ + assets/ + release/ JSON schemas (patch-entry, config), community plugin samples, README/chart/logo assets, and release artefacts (SBOM, constraints).
pyproject.toml Single source of truth for packaging and all tool config — [tool.pytest.ini_options], [tool.ruff], [tool.mypy], and the setuptools package layout.
Makefile The maintainer entry point: make gates (CI gates), make test, make docs, make gui-build, pin-bump preflight, audits.

Contributing

Bug reports, new patches with empirical evidence, new model recipes, and cross-rig bench reports are all welcome. The full workflow (anchor conventions, lifecycle ratchet, pin-bump playbook, PR template) is in docs/CONTRIBUTING.md. Security disclosures go through SECURITY.md.

Ecosystem / Related

  • vLLM — the upstream engine SNDR Core patches. Genesis is an overlay, not a fork; patches retire as upstream merges the underlying fix.
  • club-3090 — community, multi-engine (vLLM · llama.cpp · ik_llama) serving recipes for consumer GPUs. Complementary to this repo and cross-references Genesis in its TQ3_MTP_GENESIS.md. Where SNDR Core fits: it is the deep, single-stack vLLM patch engine; club-3090 is the broad multi-engine recipe hub. If you want the widest engine/model menu, start there; if you want the fastest, most-patched vLLM path on Ampere/Ada/Blackwell, you're in the right place.

Credits + license

Apache-2.0 (see LICENSE). Per-patch attribution and upstream PR linkage in docs/CREDITS.md.

Author: Sandermage (Aleksandr Barzov), Odessa, Ukraine. Sponsorship channels (voluntary, no obligations) and hardware-loan contact: docs/SPONSORS.md.

About

SNDR Core Engine (Genesis) — vLLM runtime patch-overlay for Qwen3.6 + Gemma4 on consumer NVIDIA (Ampere sm_86, 2× A5000/3090). Qwen3.6-35B-A3B FP8 ~240 tok/s, 27B-int4 hybrid GDN+Mamba, Gemma4 26B/31B AWQ, 256K ctx. 321 patches: TurboQuant k8v4 KV, MTP/DFlash spec-decode, FULL cudagraph, hybrid GDN. vLLM pin dev424 + Control Center GUI.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors