GGML in Practice

Building Portable, High-Performance Inference Engines in C/C++

A technical manuscript on the ggml tensor library and the llama.cpp inference engine. Twenty chapters and four appendices, ~95,000 words, written in O'Reilly-style prose.

What this is

A complete book covering ggml from first principles to production deployment:

The tensor model, memory planning, and the GGUF file format.
Quantization arithmetic, the K-quant and I-quant zoos, and writing a new quantization type from scratch.
The CPU backend, SIMD intrinsics across AVX-512 / AVX2 / NEON / SVE / RVV, and where BLAS fits in.
The backend abstraction and the heterogeneous backends — CUDA, Metal, Vulkan, SYCL, ROCm, CANN.
FlashAttention, KV-cache quantization, and speculative decoding.
Profiling and optimization workflow with an eight-bottleneck catalog.
A deep dive into llama.cpp itself — the model graph, tokenizers, KV cache, sampler chain, server, multimodal, LoRA, and how to add a new architecture.
The broader ecosystem (Whisper, Stable Diffusion, Ollama, LM Studio, bindings).
Honest comparison against PyTorch, vLLM, TensorRT-LLM, MLX, ONNX Runtime, ExecuTorch, and the rest.
Production lessons: quantization decisions, latency budgets, observability, safety, licensing.
The road ahead and four appendices (C/C++ idioms, linear algebra cheatsheet, exotic build targets, reading list).

Layout

preface.md
ch01_why_ggml_exists.md ... ch20_road_ahead.md
appendix_a_c_idioms.md ... appendix_d_reading_list.md
BOOK_PLAN.md                ← the working outline
build_pdf.py                ← single-file → PDF builder
ggml_in_practice.pdf        ← the rendered book (244 pages)

Building the PDF

The PDF is rebuilt from the markdown sources via headless Chrome:

pip install markdown
python3 build_pdf.py

Output: ggml_in_practice.pdf (~3.7 MB, 244 pages, A4).

The renderer uses Python's markdown library to produce HTML, then headless Chrome to print it to PDF with print-quality CSS (Georgia for body, Helvetica Neue for headings, monospace for code, pageable sidebars and tables).

Reading order

The chapters are designed to be read in order, but several legitimate subsets exist:

Minimum viable subset for shipping: Chapters 2, 5, 7, 16. (Setup, GGUF, choosing a quant, llama.cpp.)
Performance engineer: Chapters 4, 9, 10, 14, 15.
Backend porting: Chapters 3, 12, 13.
Comparing against alternatives: Chapters 17, 18.
Production checklist: Chapter 19.

License

The manuscript is released under CC-BY-4.0. The accompanying code (build_pdf.py) is MIT.

Errata and updates

ggml and llama.cpp are moving targets. The principles in this book change slowly; specific kernel names, file paths, and quant tables may have shifted by the time you read it. Pull requests for corrections are welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code		code
.gitignore		.gitignore
BOOK_PLAN.md		BOOK_PLAN.md
README.md		README.md
appendix_a_c_idioms.md		appendix_a_c_idioms.md
appendix_b_linalg.md		appendix_b_linalg.md
appendix_c_exotic_targets.md		appendix_c_exotic_targets.md
appendix_d_reading_list.md		appendix_d_reading_list.md
build_pdf.py		build_pdf.py
ch01_why_ggml_exists.md		ch01_why_ggml_exists.md
ch02_workbench.md		ch02_workbench.md
ch03_tensors_contexts_graph.md		ch03_tensors_contexts_graph.md
ch04_memory.md		ch04_memory.md
ch05_gguf.md		ch05_gguf.md
ch06_quantization_principles.md		ch06_quantization_principles.md
ch07_quant_zoo.md		ch07_quant_zoo.md
ch08_writing_new_quant.md		ch08_writing_new_quant.md
ch09_cpu_backend.md		ch09_cpu_backend.md
ch10_simd.md		ch10_simd.md
ch11_beyond_handwritten.md		ch11_beyond_handwritten.md
ch12_backend_abstraction.md		ch12_backend_abstraction.md
ch13_heterogeneous_backends.md		ch13_heterogeneous_backends.md
ch14_flash_attention.md		ch14_flash_attention.md
ch15_profiling_optimization.md		ch15_profiling_optimization.md
ch16_llama_cpp.md		ch16_llama_cpp.md
ch17_ecosystem.md		ch17_ecosystem.md
ch18_industry_comparison.md		ch18_industry_comparison.md
ch19_production_lessons.md		ch19_production_lessons.md
ch20_road_ahead.md		ch20_road_ahead.md
ggml_in_practice.pdf		ggml_in_practice.pdf
preface.md		preface.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GGML in Practice

What this is

Layout

Building the PDF

Reading order

License

Errata and updates

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GGML in Practice

What this is

Layout

Building the PDF

Reading order

License

Errata and updates

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages