| # Leap-0 | |
| This repository contains the implementation of a lightweight, modified version of the GPT architecture **Leap-0** trained from scratch using FineWeb-Edu, an open-source dataset. The project demonstrates the design, training, and optimization of a custom natural language model on local hardware. | |
| <div align="center"> | |
| <img src="LLM.drawio.png" alt="Description of the image" width="300"> | |
| <p><strong>Figure 1: Architecture of Leap</p> | |
| </div> | |
| ## Features | |
| - **Custom GPT Architecture**: A miniaturized version of the GPT model tailored for efficient training on limited hardware. | |
| - **Local Training**: Complete model training executed on local resources, enabling cost-effective development. | |
| - **Open-Source Datasets**: Trained using publicly available FineWeb-Edu dataset to ensure accessibility and reproducibility. | |
| - **Scalable Design**: Architecture optimized for experimentation and scalability while maintaining resource efficiency. | |
| ## Implementation Details | |
| 1. **Model Architecture** | |
| - A streamlined GPT-based architecture designed for reduced complexity and improved training efficiency. | |
| - Incorporates modifications to parameter scaling to suit resource-constrained environments. | |
| 2. **Training** | |
| - Training executed locally on NVIDIA GeForce RTX 4500 ada 24GB GPU, leveraging PyTorch. | |
| 3. **Testing** | |
| - A simple Streamlit UI created for testing generation capability of the model. | |
| ## Model Architecture | |
| ### Configuration | |
| - **Sequence Length:** 512 tokens | |
| - **Vocabulary Size:** 48,951 tokens | |
| - Includes 50,000 BPE merges, 256 special byte tokens, and 1 `<|endoftext|>` token. | |
| - **Number of Layers:** 4 transformer blocks | |
| - **Attention Heads:** 8 per block | |
| - **Embedding Dimension:** 512 | |
| - **Dropout:** 0.1 | |
| ### Components | |
| 1. **Embeddings:** | |
| - **Word Embeddings (`wte`):** Learnable token embeddings of size `n_embd`. | |
| - **Position Embeddings (`wpe`):** Learnable positional embeddings for sequences up to `block_size`. | |
| 2. **Transformer Blocks:** | |
| - A stack of 4 transformer blocks, each comprising: | |
| - Multi-head self-attention mechanisms. | |
| - Feedforward networks for feature transformation. | |
| 3. **Output Head:** | |
| - **Linear Layer (`lm_head`):** Maps hidden states to logits for token predictions. | |
| - Implements weight sharing between token embeddings (`wte`) and output projection for parameter efficiency. | |
| 4. **Layer Normalization:** | |
| - Final layer normalization (`ln_f`) ensures stable optimization. | |
| ## Current Status: | |
| 1. Dataset Used: FineWeb-Edu (18.5 GB) entirely. | |
| 2. Training Steps: 5000 | |
| 3. Time Taken: ~ 7 hours | |
| 4. File format: .pt | |
| ## Requirements | |
| - Python 3.8+ | |
| - PyTorch 2.0+ or TensorFlow 2.10+ | |
| - CUDA-enabled GPU with at least 4GB VRAM (recommended) | |
| - Dependencies listed in `requirements.txt` | |
| - **Note**: Different OS support different versions of PyTorch/Tensorflow to use CUDA (local GPU). Install only after verifying for your OS. | |