Building Portable, High-Performance Inference Engines in C/C++
A technical manuscript on the ggml tensor library and the llama.cpp inference engine. Twenty chapters and four appendices, ~95,000 words, written in O'Reilly-style prose.
A complete book covering ggml from first principles to production deployment:
- The tensor model, memory planning, and the GGUF file format.
- Quantization arithmetic, the K-quant and I-quant zoos, and writing a new quantization type from scratch.
- The CPU backend, SIMD intrinsics across AVX-512 / AVX2 / NEON / SVE / RVV, and where BLAS fits in.
- The backend abstraction and the heterogeneous backends — CUDA, Metal, Vulkan, SYCL, ROCm, CANN.
- FlashAttention, KV-cache quantization, and speculative decoding.
- Profiling and optimization workflow with an eight-bottleneck catalog.
- A deep dive into
llama.cppitself — the model graph, tokenizers, KV cache, sampler chain, server, multimodal, LoRA, and how to add a new architecture. - The broader ecosystem (Whisper, Stable Diffusion, Ollama, LM Studio, bindings).
- Honest comparison against PyTorch, vLLM, TensorRT-LLM, MLX, ONNX Runtime, ExecuTorch, and the rest.
- Production lessons: quantization decisions, latency budgets, observability, safety, licensing.
- The road ahead and four appendices (C/C++ idioms, linear algebra cheatsheet, exotic build targets, reading list).
preface.md
ch01_why_ggml_exists.md ... ch20_road_ahead.md
appendix_a_c_idioms.md ... appendix_d_reading_list.md
BOOK_PLAN.md ← the working outline
build_pdf.py ← single-file → PDF builder
ggml_in_practice.pdf ← the rendered book (244 pages)
The PDF is rebuilt from the markdown sources via headless Chrome:
pip install markdown
python3 build_pdf.pyOutput: ggml_in_practice.pdf (~3.7 MB, 244 pages, A4).
The renderer uses Python's markdown library to produce HTML, then headless Chrome to print it to PDF with print-quality CSS (Georgia for body, Helvetica Neue for headings, monospace for code, pageable sidebars and tables).
The chapters are designed to be read in order, but several legitimate subsets exist:
- Minimum viable subset for shipping: Chapters 2, 5, 7, 16. (Setup, GGUF, choosing a quant, llama.cpp.)
- Performance engineer: Chapters 4, 9, 10, 14, 15.
- Backend porting: Chapters 3, 12, 13.
- Comparing against alternatives: Chapters 17, 18.
- Production checklist: Chapter 19.
The manuscript is released under CC-BY-4.0. The accompanying code (build_pdf.py) is MIT.
ggml and llama.cpp are moving targets. The principles in this book change slowly; specific kernel names, file paths, and quant tables may have shifted by the time you read it. Pull requests for corrections are welcome.