llm-from-scratch

A high-performance, transformer-based language model implemented from scratch. This project focuses on the granular implementation of modern LLM architectures, moving away from high-level PyTorch abstractions to explicitly demostrate the underlying mechanics of large language models. Also, it is trainable.

Overview

This repository contains a complete implementation of a Transformer-based LLM. The core philosophy of this project is zero reliance on high-level torch.nn layers (like nn.Linear, nn.LayerNorm, or nn.Transformer).

Key Technical Features

Model Architecture:
- Pre-normalization: Improved training stability by normalizing before blocks.
- RMSNorm: Root Mean Square Layer Normalization for faster computation.
- SwiGLU Activation: Implementation of the Gated Linear Unit variant used in Llama architectures.
- Rotary Positional Embeddings (RoPE): Features an auto-expanding implementation for flexible sequence lengths.
Training & Optimization:
- Custom AdamW: Built from the torch.optim.Optimizer base class.
- Learning Rate Schedule: Cosine annealing for optimal convergence.
- Gradient Clipping: To prevent exploding gradients during intense training phases.

🏗 Architecture

The model strictly follows the architectural flow illustrated below:

📂 Project Structure

├── main/
│   ├── model.py                # Core Transformer component implementations
│   ├── tokenizer_optimized.py   # Custom BPE/Tokenization
│   ├── train_model.py          # Training utils and optimizer
│   ├── run_train_model.py      # Entry point for training
│   └── play_model.ipynb        # Play with the trained model
├── tokenized_data/             # Pre-processed datasets for training
├── trained_tokenizer/          # Saved tokenizer states
├── img/                        # Architectural diagrams
├── run.sh                      # Shell script for one-touch execution
└── generate_tree.py            # Utility for project visualization

🛠 Usage (to train it urself)

This implementation is optimized for efficiency and is fully trainable on consumer hardware, including MacBook Pro (M-series) chips.

Quick Start

To begin training on whatever dataset you like:

Clone the repository.
Ensure you have PyTorch installed.
Implement the tokenizer to fill in trained_tokenizer and tokenized_data.
Execute the training script (you may need to adjust some params a little before running):
```
./run.sh
```

🧪 Implementation Constraints

To demonstrate technical rigor, this project intentionally avoids torch.nn high-level definitions. The only components used from torch are:

torch.nn.Parameter: For weight initialization.
Container classes: (Module, ModuleList, Sequential) for organizing the model graph.
torch.optim.Optimizer: Used only as a base class for a ground-up AdamW implementation.

Lessons Learned

Building a Transformer without the safety nets of torch.nn high-level modules revealed several non-obvious engineering challenges:

1. Numerical Stability in Custom Layers

Implementing RMSNorm and Softmax from scratch highlighted the importance of numerical stability. Without torch.nn.LayerNorm, the key thing is to ensure that the epsilon placement was precise to avoid division-by-zero or overflow during the reciprocal square root calculation.

2. The Nuance of RoPE (Rotary Positional Embeddings)

Implementing RoPE required a deep dive into complex number rotations. A more challenging part is to create an auto-expanding cache for the rotation frequencies so the model can handle sequence lengths beyond the initial training window without re-calculating the rotation matrix from scratch every time.

3. Manual Weight Management

In nn.Linear, the actually multiply in the forward pass is $y = xW^\top$ rather than $y = Wx$ because the row-major memory ordering in PyTorch. This reinforced my understanding of how PyTorch manages memory and tensor layouts under the hood.

4. Optimizer State Tracking

Implementing AdamW from the base Optimizer class was a masterclass in state management. I had to manually track the first and second moments ($m_t$ and $v_t$) for every parameter and ensure the decoupled weight decay was applied correctly ie distinct from the gradient update, to maintain the regularization benefits that standard Adam loses.

🙏 Ackowledgements

Stanford University CS336: A profound thank you to the course instructors and material for the guidance and motivation required to implement these complex systems from the ground up.
Xuying Li: For the excellent recommendation of the CS336 curriculum.

License: MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-from-scratch

Overview

Key Technical Features

🏗 Architecture

📂 Project Structure

🛠 Usage (to train it urself)

Quick Start

🧪 Implementation Constraints

Lessons Learned

1. Numerical Stability in Custom Layers

2. The Nuance of RoPE (Rotary Positional Embeddings)

3. Manual Weight Management

4. Optimizer State Tracking

🙏 Ackowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
img		img
main		main
tokenized_data		tokenized_data
trained_tokenizer		trained_tokenizer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_tree.py		generate_tree.py
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

llm-from-scratch

Overview

Key Technical Features

🏗 Architecture

📂 Project Structure

🛠 Usage (to train it urself)

Quick Start

🧪 Implementation Constraints

Lessons Learned

1. Numerical Stability in Custom Layers

2. The Nuance of RoPE (Rotary Positional Embeddings)

3. Manual Weight Management

4. Optimizer State Tracking

🙏 Ackowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages