TriLite.jl

Pure-Julia CPU inference engine for BitNet b1.58 ternary LLMs.

TriLite lets you run BitNet b1.58 models — large language models with {-1, 0, +1} ternary weights — on any CPU, with zero C/C++ dependencies. Built entirely in Julia. Targets Llama-class architectures (GQA, RoPE, SwiGLU, RMSNorm).

What is BitNet b1.58?

BitNet b1.58 is a ternary LLM architecture by Microsoft Research where every weight is constrained to one of three values: -1, 0, or +1 (1.58 bits per parameter on average). This replaces the standard FP16/BF16 weights used in models like Llama or GPT.

Why it matters:

Memory: A 2B-parameter model fits in ~400 MB (vs ~4 GB for FP16)
Compute: Multiply-accumulate becomes integer addition — no floating-point multipliers
Energy: ASIC estimates show 10–100× power reduction over FP16 matmul
Quality: Matches full-precision perplexity at the same parameter count

Ternary quantization is not just for inference — the model is trained from scratch at b1.58 precision, unlike post-training quantization approaches.

The Julia Implementation

Most BitNet inference engines (microsoft/BitNet, llama.cpp) are written in C++. TriLite is a pure-Julia reimplementation with three kernel backends:

Kernel	Mechanism	When to use
Reference	Scalar `Int8 × Float32`	Correctness baseline
MAD	`vpmaddubsw` SIMD pairs	Prefill (compute-bound)
LUT	`pshufb` table lookup	Decode (memory-bound)

The engine is dispatch-based: a single matmul!(DEFAULT_KERNEL, ...) call selects the optimal kernel at compile time via Julia's trait dispatch. No runtime branching.

Other implementation details:

GGUF v3 model loader with tensor_map for O(1) tensor lookup
GPT-2 BPE tokenizer extracted directly from GGUF metadata (no separate file needed)
RoPE, RMSNorm, ReLU² activation, group-query attention — all in pure Julia
Platform dispatch via Sys.ARCH — x86_64 (AVX2) and aarch64 (NEON) intrinsics

Features

Pure Julia: Zero C/C++ FFI — compiles with the Julia JIT
CPU-only: No GPU required. x86-64 (AVX2) and ARM64 (NEON)
Ternary weights: Uses 2-bit GGUF format (GGML dtype 36)
Chat interface: chat(model) for interactive generation
GGUF v3: Loads standard BitNet GGUF files from HuggingFace
Internal tokenizer: GPT-2 BPE extracted from GGUF metadata

Quick Start

using TriLite

# Load model
model = load_model("path/to/bitnet-2B.gguf")

# Generate text
response = generate(model, "What is the capital of France?")
println(response)

# Interactive chat
chat(model)

Installation

using Pkg
Pkg.add("TriLite")

Or from source:

cd TriLite.jl
julia --project -e 'using Pkg; Pkg.instantiate()'

Requirements

Julia 1.12+ (recommended)
x86-64 CPU with AVX2 support, or ARM64 CPU with NEON
4GB+ RAM for 2B model, 16GB+ for 8B model

Model Download

Download BitNet models from HuggingFace:

# 2B model (~400MB)
huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf

# 8B model (~2GB)
huggingface-cli download microsoft/BitNet-b1.58-8B-4T-gguf

Architecture

TriLite.jl/
├── src/
│   ├── TriLite.jl          # Main module
│   ├── types.jl           # Core data types
│   ├── intrinsics/
│   │   ├── detection.jl   # CPU feature detection
│   │   ├── x86_64.jl      # AVX2 intrinsics
│   │   ├── aarch64.jl     # NEON intrinsics
│   │   └── fallback.jl    # Scalar fallback
│   ├── kernels/
│   │   ├── utils.jl       # Kernel utilities
│   │   ├── dispatch.jl    # Trait-based kernel dispatch
│   │   ├── matmul_ref.jl  # Reference kernel
│   │   ├── matmul_mad.jl  # MAD kernel
│   │   └── matmul_lut.jl  # LUT kernel
│   ├── gguf/
│   │   ├── loader.jl      # GGUF file parser
│   │   ├── repack.jl      # Weight repacking
│   │   └── tokenizer.jl   # BPE tokenizer
│   ├── transformer/
│   │   ├── attention.jl   # Multi-head attention
│   │   ├── ffn.jl         # Feed-forward network
│   │   ├── layer.jl       # Transformer layer
│   │   └── model.jl       # Full model
│   ├── sampling.jl        # Token sampling
│   └── io.jl              # Chat interface
├── test/
│   ├── runtests.jl        # Test runner
│   ├── test_intrinsics.jl # Intrinsic tests
│   ├── test_kernels.jl    # Kernel tests
│   └── test_model.jl      # Model tests
└── Project.toml

Kernels

Reference Kernel

Scalar implementation for testing and validation.

MAD Kernel

Uses vpmaddubsw for 2× throughput on x86-64. Best for: Prefill phase (compute-bound).

LUT Kernel

Uses pshufb for table lookup. Best for: Decode phase (memory-bound).

Performance

Currently uses the reference (scalar) kernel by default — ~0.015 tok/s on a 30-layer model (~1 min per token). The MAD and LUT SIMD kernels are implemented and tested but need activation quantization wired into the model pipeline for a ~50–100× speedup. Contributions welcome.

Testing

cd TriLite.jl
julia --project=. test/runtests.jl

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
src		src
test		test
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Project.toml		Project.toml
README.md		README.md
chat.jl		chat.jl
check_meta2.jl		check_meta2.jl
check_metadata.jl		check_metadata.jl
check_shapes.jl		check_shapes.jl
debug_test.jl		debug_test.jl
probe_tensors.jl		probe_tensors.jl
run_kernels.jl		run_kernels.jl
run_tests.jl		run_tests.jl
test_gen.jl		test_gen.jl
test_gguf_probe.jl		test_gguf_probe.jl
test_i2s.jl		test_i2s.jl
test_model_load.jl		test_model_load.jl
test_tok.jl		test_tok.jl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TriLite.jl

What is BitNet b1.58?

The Julia Implementation

Features

Quick Start

Installation

Requirements

Model Download

Architecture

Kernels

Reference Kernel

MAD Kernel

LUT Kernel

Performance

Testing

License

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TriLite.jl

What is BitNet b1.58?

The Julia Implementation

Features

Quick Start

Installation

Requirements

Model Download

Architecture

Kernels

Reference Kernel

MAD Kernel

LUT Kernel

Performance

Testing

License

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages