# Ursa_Minor_Smashed A GPT-2 (124M parameters) model trained from scratch on the FineWeb-edu dataset using a modern training recipe. This model reproduces the GPT-2 small architecture with updated training techniques including Flash Attention, mixed precision training, and improved optimization strategies. ## Model Details ### Model Description - **Developed by:** Kaileh57 - **Model type:** GPT-2 (Decoder-only Transformer) - **Language(s):** English - **License:** MIT - **Finetuned from model:** Trained from scratch ### Model Architecture | Parameter | Value | |-----------|-------| | Parameters | 124M | | Layers | 12 | | Hidden Size | 768 | | Attention Heads | 12 | | Context Length | 1024 | | Vocabulary Size | 50,304 | | Activation Function | GELU (tanh approximation) | | Position Embeddings | Learned | | Layer Norm | Pre-normalization | | Attention Type | Multi-head with Flash Attention | | Weight Tying | Token embeddings tied with output projection | ### Training Details - **Dataset:** FineWeb-edu (10B token sample) - **Training Regime:** Mixed precision (bfloat16) - **Optimizer:** AdamW (β₁=0.9, β₂=0.95, ε=1e-8) - **Learning Rate:** 1.5e-3 with cosine decay - **Batch Size:** 524,288 tokens - **Training Steps:** 19,073 (1 epoch) - **Warmup Steps:** 715 - **Weight Decay:** 0.1 - **Gradient Clipping:** 1.0 - **Hardware:** NVIDIA RTX A6000 (48GB) - **Training Time:** ~2 days ### Performance | Benchmark | Score | |-----------|-------| | HellaSwag | 32.4% | | Final Loss | 2.85 | | Perplexity | ~17.3 | *Note: Scores are from single epoch training. Multi-epoch training reaches ~35% on HellaSwag.* ## Uses ### Direct Use This model can be used for: - Text generation - Research on efficient training methods - Educational purposes to understand GPT architectures - Fine-tuning for specific downstream tasks ### Out-of-Scope Use - Production applications requiring high reliability - Generation of factual information (model may hallucinate) - Any use case requiring larger context than 1024 tokens ## Quick Start ### Installation #### CPU Installation (Default) ```bash # Clone the repository git clone https://github.com/Kaileh57/Ursa_Minor_Smashed.git cd Ursa_Minor_Smashed # Set up CPU environment pip install -r requirements.txt # or use the setup script: ./setup.sh ``` #### CUDA Installation (For GPU Acceleration) If you have a CUDA-capable GPU and want to use GPU acceleration: **Linux/macOS:** ```bash # Use the automated CUDA setup script chmod +x setup-cuda.sh ./setup-cuda.sh ``` **Windows:** ```batch # Use the Windows CUDA setup script setup-cuda.bat ``` **Manual CUDA Installation:** ```bash # Create separate CUDA environment python -m venv venv-cuda source venv-cuda/bin/activate # On Windows: venv-cuda\Scripts\activate # Install CUDA requirements pip install -r requirements-cuda.txt ``` **CUDA Requirements:** - NVIDIA GPU with CUDA Compute Capability 3.5 or higher - NVIDIA drivers installed - CUDA 11.8 or 12.1 toolkit (optional, PyTorch includes CUDA runtime) - At least 4GB GPU memory recommended ### Basic Usage #### Command Line Interface **CUDA Version (GPU):** ```bash # Basic text generation python inference_cuda.py --prompt "Hello, I'm a language model" --max-tokens 50 # Creative writing python inference_cuda.py --prompt "Once upon a time" --max-tokens 200 --temperature 0.9 # More focused output python inference_cuda.py --prompt "The key to machine learning is" --max-tokens 100 --temperature 0.7 --top-k 50 ``` **CPU Version:** ```bash # Basic text generation python inference_cpu.py --prompt "Hello, I'm a language model" --max-tokens 50 # Creative writing python inference_cpu.py --prompt "Once upon a time" --max-tokens 120 --temperature 0.9 # More focused output python inference_cpu.py --prompt "The key to machine learning is" --max-tokens 80 --temperature 0.7 --top-k 30 ``` #### Python Interface **CUDA Version:** ```python from inference_cuda import generate_direct, load_model_direct # Load model once (requires CUDA) model = load_model_direct("model_optimized.pt") # Generate text with CUDA optimizations result = generate_direct( model, "Hello, I'm a language model", max_new_tokens=100, # Higher tokens for GPU temperature=0.8, top_k=50 # Higher top_k for better quality ) print(result) ``` **CPU Version:** ```python from inference_cpu import generate_direct, load_model_direct # Load model once (CPU only) model = load_model_direct("model_optimized.pt") # Generate text with CPU optimizations result = generate_direct( model, "Hello, I'm a language model", max_new_tokens=80, # Lower tokens for CPU efficiency temperature=0.8, top_k=30 # Lower top_k for CPU efficiency ) print(result) ``` #### Chat Interface **CUDA Version:** ```bash # Start CUDA-optimized chat python chat_cuda.py ``` **CPU Version:** ```bash # Start CPU-optimized chat python chat_cpu.py ``` ## Training Procedure The model was trained using a modern GPT training recipe including: - Flash Attention for efficient attention computation - Mixed precision training with bfloat16 - Gradient accumulation to achieve large batch sizes - TF32 for faster matrix multiplications - Optimized vocabulary size (50,304) for better GPU utilization ### Training Hyperparameters - **Learning rate schedule:** Cosine decay from 1.5e-3 to 1.5e-4 - **Gradient accumulation steps:** Dynamically calculated - **Mixed precision:** bfloat16 with PyTorch autocast ## Evaluation ### Testing Data Evaluated on: - FineWeb-edu validation set - HellaSwag benchmark (10,042 examples) ### Metrics - Cross-entropy loss on validation set - Accuracy on HellaSwag commonsense reasoning ## Technical Specifications ### Compute Infrastructure - **Hardware:** 1x NVIDIA RTX A6000 (48GB VRAM) - **Software:** PyTorch 2.0+, CUDA 12.1, Flash Attention 2 ### Model Initialization - Weights initialized with σ=0.02 (scaled by 1/√(2×n_layers) for residual projections) - Embeddings initialized with σ=0.02 - Biases initialized to zero - LayerNorm weights initialized to 1.0, biases to 0.0 ## Citation If you use this model, please cite: ```bibtex @misc{ursa-minor-smashed, author = {Kaileh57}, title = {Ursa Minor Smashed: Efficient GPT-2 Training}, year = {2024}, url = {https://github.com/Kaileh57/Ursa_Minor_Smashed} } ``` ## Available Tools ### Core Scripts - **`inference_cuda.py`** - CUDA-optimized inference script - **`inference_cpu.py`** - CPU-optimized inference script - **`chat_cuda.py`** - CUDA-optimized chat interface - **`chat_cpu.py`** - CPU-optimized chat interface - **`benchmark_cuda.py`** - CUDA performance benchmarking tool - **`benchmark_cpu.py`** - CPU performance benchmarking tool - **`convert_to_gguf.py`** - Convert to GGUF format for llama.cpp ### Examples - **`examples/basic_usage_cuda.py`** - CUDA usage examples - **`examples/basic_usage_cpu.py`** - CPU usage examples ### Parameters #### Generation Parameters - **`temperature`** (0.1-1.0): Controls randomness (lower = more focused) - **`top_k`** (1-100): Limit to top-k most likely tokens - **`top_p`** (0.1-1.0): Nucleus sampling threshold - **`repetition_penalty`** (1.0-2.0): Reduce repetitive output - **`max_tokens`**: Maximum tokens to generate #### Recommended Settings - **Creative writing**: temp=0.8-0.9, top_p=0.9, top_k=50 - **Factual content**: temp=0.3-0.5, top_p=0.8, top_k=20 - **Code generation**: temp=0.2-0.4, top_p=0.7, top_k=10 ## Performance **CUDA Performance (GPU)**: - **Inference Speed**: ~50-150+ tokens/sec (depends on GPU) - **Memory Usage**: ~2-4GB VRAM - **Features**: CUDA autocast, torch.compile optimization - **Latency**: ~10-20ms per token **CPU Performance**: - **Inference Speed**: ~15-25 tokens/sec - **Memory Usage**: ~2-3GB RAM - **Features**: Multi-threading, CPU-optimized parameters - **Latency**: ~40-65ms per token **General**: - **Context Length**: 1024 tokens maximum - **Model Size**: 124M parameters ## Acknowledgments - Based on Andrej Karpathy's nanoGPT implementation - Trained on HuggingFace's FineWeb-edu dataset - Uses OpenAI's GPT-2 tokenizer