| # Ursa_Minor_Smashed | |
| A GPT-2 (124M parameters) model trained from scratch on the FineWeb-edu dataset using a modern training recipe. This model reproduces the GPT-2 small architecture with updated training techniques including Flash Attention, mixed precision training, and improved optimization strategies. | |
| ## Model Details | |
| ### Model Description | |
| - **Developed by:** Kaileh57 | |
| - **Model type:** GPT-2 (Decoder-only Transformer) | |
| - **Language(s):** English | |
| - **License:** MIT | |
| - **Finetuned from model:** Trained from scratch | |
| ### Model Architecture | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | Parameters | 124M | | |
| | Layers | 12 | | |
| | Hidden Size | 768 | | |
| | Attention Heads | 12 | | |
| | Context Length | 1024 | | |
| | Vocabulary Size | 50,304 | | |
| | Activation Function | GELU (tanh approximation) | | |
| | Position Embeddings | Learned | | |
| | Layer Norm | Pre-normalization | | |
| | Attention Type | Multi-head with Flash Attention | | |
| | Weight Tying | Token embeddings tied with output projection | | |
| ### Training Details | |
| - **Dataset:** FineWeb-edu (10B token sample) | |
| - **Training Regime:** Mixed precision (bfloat16) | |
| - **Optimizer:** AdamW (β₁=0.9, β₂=0.95, ε=1e-8) | |
| - **Learning Rate:** 1.5e-3 with cosine decay | |
| - **Batch Size:** 524,288 tokens | |
| - **Training Steps:** 19,073 (1 epoch) | |
| - **Warmup Steps:** 715 | |
| - **Weight Decay:** 0.1 | |
| - **Gradient Clipping:** 1.0 | |
| - **Hardware:** NVIDIA RTX A6000 (48GB) | |
| - **Training Time:** ~2 days | |
| ### Performance | |
| | Benchmark | Score | | |
| |-----------|-------| | |
| | HellaSwag | 32.4% | | |
| | Final Loss | 2.85 | | |
| | Perplexity | ~17.3 | | |
| *Note: Scores are from single epoch training. Multi-epoch training reaches ~35% on HellaSwag.* | |
| ## Uses | |
| ### Direct Use | |
| This model can be used for: | |
| - Text generation | |
| - Research on efficient training methods | |
| - Educational purposes to understand GPT architectures | |
| - Fine-tuning for specific downstream tasks | |
| ### Out-of-Scope Use | |
| - Production applications requiring high reliability | |
| - Generation of factual information (model may hallucinate) | |
| - Any use case requiring larger context than 1024 tokens | |
| ## Quick Start | |
| ### Installation | |
| #### CPU Installation (Default) | |
| ```bash | |
| # Clone the repository | |
| git clone https://github.com/Kaileh57/Ursa_Minor_Smashed.git | |
| cd Ursa_Minor_Smashed | |
| # Set up CPU environment | |
| pip install -r requirements.txt | |
| # or use the setup script: ./setup.sh | |
| ``` | |
| #### CUDA Installation (For GPU Acceleration) | |
| If you have a CUDA-capable GPU and want to use GPU acceleration: | |
| **Linux/macOS:** | |
| ```bash | |
| # Use the automated CUDA setup script | |
| chmod +x setup-cuda.sh | |
| ./setup-cuda.sh | |
| ``` | |
| **Windows:** | |
| ```batch | |
| # Use the Windows CUDA setup script | |
| setup-cuda.bat | |
| ``` | |
| **Manual CUDA Installation:** | |
| ```bash | |
| # Create separate CUDA environment | |
| python -m venv venv-cuda | |
| source venv-cuda/bin/activate # On Windows: venv-cuda\Scripts\activate | |
| # Install CUDA requirements | |
| pip install -r requirements-cuda.txt | |
| ``` | |
| **CUDA Requirements:** | |
| - NVIDIA GPU with CUDA Compute Capability 3.5 or higher | |
| - NVIDIA drivers installed | |
| - CUDA 11.8 or 12.1 toolkit (optional, PyTorch includes CUDA runtime) | |
| - At least 4GB GPU memory recommended | |
| ### Basic Usage | |
| #### Command Line Interface | |
| **CUDA Version (GPU):** | |
| ```bash | |
| # Basic text generation | |
| python inference_cuda.py --prompt "Hello, I'm a language model" --max-tokens 50 | |
| # Creative writing | |
| python inference_cuda.py --prompt "Once upon a time" --max-tokens 200 --temperature 0.9 | |
| # More focused output | |
| python inference_cuda.py --prompt "The key to machine learning is" --max-tokens 100 --temperature 0.7 --top-k 50 | |
| ``` | |
| **CPU Version:** | |
| ```bash | |
| # Basic text generation | |
| python inference_cpu.py --prompt "Hello, I'm a language model" --max-tokens 50 | |
| # Creative writing | |
| python inference_cpu.py --prompt "Once upon a time" --max-tokens 120 --temperature 0.9 | |
| # More focused output | |
| python inference_cpu.py --prompt "The key to machine learning is" --max-tokens 80 --temperature 0.7 --top-k 30 | |
| ``` | |
| #### Python Interface | |
| **CUDA Version:** | |
| ```python | |
| from inference_cuda import generate_direct, load_model_direct | |
| # Load model once (requires CUDA) | |
| model = load_model_direct("model_optimized.pt") | |
| # Generate text with CUDA optimizations | |
| result = generate_direct( | |
| model, | |
| "Hello, I'm a language model", | |
| max_new_tokens=100, # Higher tokens for GPU | |
| temperature=0.8, | |
| top_k=50 # Higher top_k for better quality | |
| ) | |
| print(result) | |
| ``` | |
| **CPU Version:** | |
| ```python | |
| from inference_cpu import generate_direct, load_model_direct | |
| # Load model once (CPU only) | |
| model = load_model_direct("model_optimized.pt") | |
| # Generate text with CPU optimizations | |
| result = generate_direct( | |
| model, | |
| "Hello, I'm a language model", | |
| max_new_tokens=80, # Lower tokens for CPU efficiency | |
| temperature=0.8, | |
| top_k=30 # Lower top_k for CPU efficiency | |
| ) | |
| print(result) | |
| ``` | |
| #### Chat Interface | |
| **CUDA Version:** | |
| ```bash | |
| # Start CUDA-optimized chat | |
| python chat_cuda.py | |
| ``` | |
| **CPU Version:** | |
| ```bash | |
| # Start CPU-optimized chat | |
| python chat_cpu.py | |
| ``` | |
| ## Training Procedure | |
| The model was trained using a modern GPT training recipe including: | |
| - Flash Attention for efficient attention computation | |
| - Mixed precision training with bfloat16 | |
| - Gradient accumulation to achieve large batch sizes | |
| - TF32 for faster matrix multiplications | |
| - Optimized vocabulary size (50,304) for better GPU utilization | |
| ### Training Hyperparameters | |
| - **Learning rate schedule:** Cosine decay from 1.5e-3 to 1.5e-4 | |
| - **Gradient accumulation steps:** Dynamically calculated | |
| - **Mixed precision:** bfloat16 with PyTorch autocast | |
| ## Evaluation | |
| ### Testing Data | |
| Evaluated on: | |
| - FineWeb-edu validation set | |
| - HellaSwag benchmark (10,042 examples) | |
| ### Metrics | |
| - Cross-entropy loss on validation set | |
| - Accuracy on HellaSwag commonsense reasoning | |
| ## Technical Specifications | |
| ### Compute Infrastructure | |
| - **Hardware:** 1x NVIDIA RTX A6000 (48GB VRAM) | |
| - **Software:** PyTorch 2.0+, CUDA 12.1, Flash Attention 2 | |
| ### Model Initialization | |
| - Weights initialized with σ=0.02 (scaled by 1/√(2×n_layers) for residual projections) | |
| - Embeddings initialized with σ=0.02 | |
| - Biases initialized to zero | |
| - LayerNorm weights initialized to 1.0, biases to 0.0 | |
| ## Citation | |
| If you use this model, please cite: | |
| ```bibtex | |
| @misc{ursa-minor-smashed, | |
| author = {Kaileh57}, | |
| title = {Ursa Minor Smashed: Efficient GPT-2 Training}, | |
| year = {2024}, | |
| url = {https://github.com/Kaileh57/Ursa_Minor_Smashed} | |
| } | |
| ``` | |
| ## Available Tools | |
| ### Core Scripts | |
| - **`inference_cuda.py`** - CUDA-optimized inference script | |
| - **`inference_cpu.py`** - CPU-optimized inference script | |
| - **`chat_cuda.py`** - CUDA-optimized chat interface | |
| - **`chat_cpu.py`** - CPU-optimized chat interface | |
| - **`benchmark_cuda.py`** - CUDA performance benchmarking tool | |
| - **`benchmark_cpu.py`** - CPU performance benchmarking tool | |
| - **`convert_to_gguf.py`** - Convert to GGUF format for llama.cpp | |
| ### Examples | |
| - **`examples/basic_usage_cuda.py`** - CUDA usage examples | |
| - **`examples/basic_usage_cpu.py`** - CPU usage examples | |
| ### Parameters | |
| #### Generation Parameters | |
| - **`temperature`** (0.1-1.0): Controls randomness (lower = more focused) | |
| - **`top_k`** (1-100): Limit to top-k most likely tokens | |
| - **`top_p`** (0.1-1.0): Nucleus sampling threshold | |
| - **`repetition_penalty`** (1.0-2.0): Reduce repetitive output | |
| - **`max_tokens`**: Maximum tokens to generate | |
| #### Recommended Settings | |
| - **Creative writing**: temp=0.8-0.9, top_p=0.9, top_k=50 | |
| - **Factual content**: temp=0.3-0.5, top_p=0.8, top_k=20 | |
| - **Code generation**: temp=0.2-0.4, top_p=0.7, top_k=10 | |
| ## Performance | |
| **CUDA Performance (GPU)**: | |
| - **Inference Speed**: ~50-150+ tokens/sec (depends on GPU) | |
| - **Memory Usage**: ~2-4GB VRAM | |
| - **Features**: CUDA autocast, torch.compile optimization | |
| - **Latency**: ~10-20ms per token | |
| **CPU Performance**: | |
| - **Inference Speed**: ~15-25 tokens/sec | |
| - **Memory Usage**: ~2-3GB RAM | |
| - **Features**: Multi-threading, CPU-optimized parameters | |
| - **Latency**: ~40-65ms per token | |
| **General**: | |
| - **Context Length**: 1024 tokens maximum | |
| - **Model Size**: 124M parameters | |
| ## Acknowledgments | |
| - Based on Andrej Karpathy's nanoGPT implementation | |
| - Trained on HuggingFace's FineWeb-edu dataset | |
| - Uses OpenAI's GPT-2 tokenizer |