Skip to content

olyasir/ggml_book

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GGML in Practice

Building Portable, High-Performance Inference Engines in C/C++

A technical manuscript on the ggml tensor library and the llama.cpp inference engine. Twenty chapters and four appendices, ~95,000 words, written in O'Reilly-style prose.

What this is

A complete book covering ggml from first principles to production deployment:

  • The tensor model, memory planning, and the GGUF file format.
  • Quantization arithmetic, the K-quant and I-quant zoos, and writing a new quantization type from scratch.
  • The CPU backend, SIMD intrinsics across AVX-512 / AVX2 / NEON / SVE / RVV, and where BLAS fits in.
  • The backend abstraction and the heterogeneous backends — CUDA, Metal, Vulkan, SYCL, ROCm, CANN.
  • FlashAttention, KV-cache quantization, and speculative decoding.
  • Profiling and optimization workflow with an eight-bottleneck catalog.
  • A deep dive into llama.cpp itself — the model graph, tokenizers, KV cache, sampler chain, server, multimodal, LoRA, and how to add a new architecture.
  • The broader ecosystem (Whisper, Stable Diffusion, Ollama, LM Studio, bindings).
  • Honest comparison against PyTorch, vLLM, TensorRT-LLM, MLX, ONNX Runtime, ExecuTorch, and the rest.
  • Production lessons: quantization decisions, latency budgets, observability, safety, licensing.
  • The road ahead and four appendices (C/C++ idioms, linear algebra cheatsheet, exotic build targets, reading list).

Layout

preface.md
ch01_why_ggml_exists.md ... ch20_road_ahead.md
appendix_a_c_idioms.md ... appendix_d_reading_list.md
BOOK_PLAN.md                ← the working outline
build_pdf.py                ← single-file → PDF builder
ggml_in_practice.pdf        ← the rendered book (244 pages)

Building the PDF

The PDF is rebuilt from the markdown sources via headless Chrome:

pip install markdown
python3 build_pdf.py

Output: ggml_in_practice.pdf (~3.7 MB, 244 pages, A4).

The renderer uses Python's markdown library to produce HTML, then headless Chrome to print it to PDF with print-quality CSS (Georgia for body, Helvetica Neue for headings, monospace for code, pageable sidebars and tables).

Reading order

The chapters are designed to be read in order, but several legitimate subsets exist:

  • Minimum viable subset for shipping: Chapters 2, 5, 7, 16. (Setup, GGUF, choosing a quant, llama.cpp.)
  • Performance engineer: Chapters 4, 9, 10, 14, 15.
  • Backend porting: Chapters 3, 12, 13.
  • Comparing against alternatives: Chapters 17, 18.
  • Production checklist: Chapter 19.

License

The manuscript is released under CC-BY-4.0. The accompanying code (build_pdf.py) is MIT.

Errata and updates

ggml and llama.cpp are moving targets. The principles in this book change slowly; specific kernel names, file paths, and quant tables may have shifted by the time you read it. Pull requests for corrections are welcome.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors