Speculative Decoding for Faster LLM Inference

In this repository I have implemented speculative decoding, a technique to accelerate large language model (LLM) inference by combining a fast draft model with a larger target model.

The project explores speculative decoding as a drop-in replacement for traditional greedy / autoregressive decoding, focusing on:

Latency reduction
Token acceptance efficiency
Minimal quality degradation

Standard autoregressive decoding in LLMs is slow because:

Tokens are generated sequentially
Each step requires a full forward pass of a large model

Speculative decoding addresses this by:

Using a smaller, faster model to propose multiple tokens
Verifying those tokens in parallel using the larger model
Accepting valid tokens and rejecting incorrect ones

This significantly reduces the number of expensive forward passes.

How Speculative Decoding Works

Draft model proposes a block of tokens
Target model evaluates the proposed tokens in parallel
Tokens are accepted as long as probabilities match acceptance criteria
On rejection, decoding falls back to the target model

This preserves correctness while improving throughput.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speculative Decoding for Faster LLM Inference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Speculative Decoding for Faster LLM Inference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages