Torch-Velocity

Adaptive Speculative Decoding for LLM Inference Optimization

An implementation of speculative decoding with adaptive lookahead (γ), demonstrating 1.5-2.5x inference speedups on transformer models.

The Problem

Large Language Models are memory-bandwidth bound, not compute-bound. When generating tokens autoregressively, we spend most of our time moving weights from VRAM to compute units—even for trivial continuations like "the" or "and".

The Solution

Speculative Decoding uses a small, fast "draft" model to speculatively generate K tokens, then verifies them all in a single parallel forward pass through the large "target" model.

This implementation adds adaptive γ (lookahead length):

High acceptance rate → increase γ (be aggressive)
Low acceptance rate → decrease γ (be conservative)

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Speculative Decoding Loop                │
├─────────────────────────────────────────────────────────────┤
│  1. DRAFT: Small model generates γ tokens (fast)            │
│  2. VERIFY: Large model scores all γ tokens in ONE pass     │
│  3. ACCEPT/REJECT: Rejection sampling per Leviathan et al.  │
│  4. ROLLBACK: Rewind KV cache to valid state                │
│  5. ADAPT: Adjust γ based on acceptance rate                │
└─────────────────────────────────────────────────────────────┘

Key Components

KVCacheManager: Pre-allocated key-value cache with O(1) rollback
Rejection Sampling: Mathematically guaranteed to match target distribution
Adaptive γ: Dynamic lookahead based on real-time acceptance rates

Usage

Open velocity_demo.ipynb in Google Colab (free T4 GPU) and run all cells.

# Local setup
pip install -r requirements.txt
jupyter notebook velocity_demo.ipynb

Models

Role	Model	Parameters
Draft	distilgpt2	82M
Target	gpt2-medium	355M

References

Leviathan et al. (2023) - "Fast Inference from Transformers via Speculative Decoding"
Chen et al. (DeepMind, 2023) - "Accelerating Large Language Model Decoding with Speculative Sampling"
SpecDec++ (2024) - Adaptive candidate lengths

Author

Matt McManus

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
velocity_demo.ipynb		velocity_demo.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Torch-Velocity

The Problem

The Solution

Architecture

Key Components

Usage

Models

References

Author

About

Uh oh!

Releases

Packages

Languages

mmcmanus1/Torch-Velocity

Folders and files

Latest commit

History

Repository files navigation

Torch-Velocity

The Problem

The Solution

Architecture

Key Components

Usage

Models

References

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages