A high-performance, transformer-based language model implemented from scratch. This project focuses on the granular implementation of modern LLM architectures, moving away from high-level PyTorch abstractions to explicitly demostrate the underlying mechanics of large language models. Also, it is trainable.
This repository contains a complete implementation of a Transformer-based LLM. The core philosophy of this project is zero reliance on high-level torch.nn layers (like nn.Linear, nn.LayerNorm, or nn.Transformer).
- Model Architecture:
- Pre-normalization: Improved training stability by normalizing before blocks.
- RMSNorm: Root Mean Square Layer Normalization for faster computation.
- SwiGLU Activation: Implementation of the Gated Linear Unit variant used in Llama architectures.
- Rotary Positional Embeddings (RoPE): Features an auto-expanding implementation for flexible sequence lengths.
- Training & Optimization:
- Custom AdamW: Built from the
torch.optim.Optimizerbase class. - Learning Rate Schedule: Cosine annealing for optimal convergence.
- Gradient Clipping: To prevent exploding gradients during intense training phases.
- Custom AdamW: Built from the
The model strictly follows the architectural flow illustrated below:
├── main/
│ ├── model.py # Core Transformer component implementations
│ ├── tokenizer_optimized.py # Custom BPE/Tokenization
│ ├── train_model.py # Training utils and optimizer
│ ├── run_train_model.py # Entry point for training
│ └── play_model.ipynb # Play with the trained model
├── tokenized_data/ # Pre-processed datasets for training
├── trained_tokenizer/ # Saved tokenizer states
├── img/ # Architectural diagrams
├── run.sh # Shell script for one-touch execution
└── generate_tree.py # Utility for project visualization
This implementation is optimized for efficiency and is fully trainable on consumer hardware, including MacBook Pro (M-series) chips.
To begin training on whatever dataset you like:
- Clone the repository.
- Ensure you have PyTorch installed.
- Implement the tokenizer to fill in
trained_tokenizerandtokenized_data. - Execute the training script (you may need to adjust some params a little before running):
./run.sh
To demonstrate technical rigor, this project intentionally avoids torch.nn high-level definitions. The only components used from torch are:
torch.nn.Parameter: For weight initialization.- Container classes: (
Module,ModuleList,Sequential) for organizing the model graph. torch.optim.Optimizer: Used only as a base class for a ground-up AdamW implementation.
Building a Transformer without the safety nets of torch.nn high-level modules revealed several non-obvious engineering challenges:
Implementing RMSNorm and Softmax from scratch highlighted the importance of numerical stability. Without torch.nn.LayerNorm, the key thing is to ensure that the epsilon placement was precise to avoid division-by-zero or overflow during the reciprocal square root calculation.
Implementing RoPE required a deep dive into complex number rotations. A more challenging part is to create an auto-expanding cache for the rotation frequencies so the model can handle sequence lengths beyond the initial training window without re-calculating the rotation matrix from scratch every time.
In nn.Linear, the actually multiply in the forward pass is
Implementing AdamW from the base Optimizer class was a masterclass in state management. I had to manually track the first and second moments (
- Stanford University CS336: A profound thank you to the course instructors and material for the guidance and motivation required to implement these complex systems from the ground up.
- Xuying Li: For the excellent recommendation of the CS336 curriculum.
License: MIT
