LLM Pre-Training — JupyterLab Directory

This folder contains resources and code samples for core tasks involved in pre-training large language models. Below is an overview of the notebooks in this directory:

📓 Notebooks Overview

Attention Mechanism.ipynb
- Implements a simple variant of Self-Attention from scratch, without trainable weights.
- Great for understanding the basic math and logic behind transformer attention layers.
data_cleaning_and_preprocessing.ipynb
- Creating Tokens: Shows step-by-step text preprocessing and basic tokenization using regular expressions.
- Examples include splitting text, removing whitespaces, handling punctuation for custom tokenizers.
- Explains design decisions (e.g., when to remove or keep whitespaces for applications like code vs prose).
tokenEmbeddings.ipynb
- Demonstrates loading and working with pretrained Word2Vec embeddings for converting tokens to vectors.
- Useful for linking tokenization outputs with real numbers and prepping data for models.

How To Use

Start with data_cleaning_and_preprocessing.ipynb to learn token creation and text cleaning.
Move to tokenEmbeddings.ipynb to see how tokens are transformed into embedding vectors using popular pretrained models.
Dive into Attention Mechanism.ipynb for a hands-on implementation of the self-attention building block, which forms the heart of modern transformer architectures.

These notebooks will help you master the pre-training pipeline for LLMs: from raw text to tokens, vectors, and attention mechanisms![1]

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.ipynb_checkpoints		.ipynb_checkpoints
LLM Pre-Training		LLM Pre-Training
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Pre-Training — JupyterLab Directory

📓 Notebooks Overview

How To Use

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Pre-Training — JupyterLab Directory

📓 Notebooks Overview

How To Use

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages