Ursa_Minor_Smashed
A GPT-2 (124M parameters) model trained from scratch on the FineWeb-edu dataset using a modern training recipe. This model reproduces the GPT-2 small architecture with updated training techniques including Flash Attention, mixed precision training, and improved optimization strategies.
Model Details
Model Description
- Developed by: Kaileh57
- Model type: GPT-2 (Decoder-only Transformer)
- Language(s): English
- License: MIT
- Finetuned from model: Trained from scratch
Model Architecture
| Parameter | Value |
|---|---|
| Parameters | 124M |
| Layers | 12 |
| Hidden Size | 768 |
| Attention Heads | 12 |
| Context Length | 1024 |
| Vocabulary Size | 50,304 |
| Activation Function | GELU (tanh approximation) |
| Position Embeddings | Learned |
| Layer Norm | Pre-normalization |
| Attention Type | Multi-head with Flash Attention |
| Weight Tying | Token embeddings tied with output projection |
Training Details
- Dataset: FineWeb-edu (10B token sample)
- Training Regime: Mixed precision (bfloat16)
- Optimizer: AdamW (β₁=0.9, β₂=0.95, ε=1e-8)
- Learning Rate: 1.5e-3 with cosine decay
- Batch Size: 524,288 tokens
- Training Steps: 19,073 (1 epoch)
- Warmup Steps: 715
- Weight Decay: 0.1
- Gradient Clipping: 1.0
- Hardware: NVIDIA RTX A6000 (48GB)
- Training Time: ~2 days
Performance
| Benchmark | Score |
|---|---|
| HellaSwag | 32.4% |
| Final Loss | 2.85 |
| Perplexity | ~17.3 |
Note: Scores are from single epoch training. Multi-epoch training reaches ~35% on HellaSwag.
Uses
Direct Use
This model can be used for:
- Text generation
- Research on efficient training methods
- Educational purposes to understand GPT architectures
- Fine-tuning for specific downstream tasks
Out-of-Scope Use
- Production applications requiring high reliability
- Generation of factual information (model may hallucinate)
- Any use case requiring larger context than 1024 tokens
Quick Start
Installation
CPU Installation (Default)
# Clone the repository
git clone https://github.com/Kaileh57/Ursa_Minor_Smashed.git
cd Ursa_Minor_Smashed
# Set up CPU environment
pip install -r requirements.txt
# or use the setup script: ./setup.sh
CUDA Installation (For GPU Acceleration)
If you have a CUDA-capable GPU and want to use GPU acceleration:
Linux/macOS:
# Use the automated CUDA setup script
chmod +x setup-cuda.sh
./setup-cuda.sh
Windows:
# Use the Windows CUDA setup script
setup-cuda.bat
Manual CUDA Installation:
# Create separate CUDA environment
python -m venv venv-cuda
source venv-cuda/bin/activate # On Windows: venv-cuda\Scripts\activate
# Install CUDA requirements
pip install -r requirements-cuda.txt
CUDA Requirements:
- NVIDIA GPU with CUDA Compute Capability 3.5 or higher
- NVIDIA drivers installed
- CUDA 11.8 or 12.1 toolkit (optional, PyTorch includes CUDA runtime)
- At least 4GB GPU memory recommended
Basic Usage
Command Line Interface
CUDA Version (GPU):
# Basic text generation
python inference_cuda.py --prompt "Hello, I'm a language model" --max-tokens 50
# Creative writing
python inference_cuda.py --prompt "Once upon a time" --max-tokens 200 --temperature 0.9
# More focused output
python inference_cuda.py --prompt "The key to machine learning is" --max-tokens 100 --temperature 0.7 --top-k 50
CPU Version:
# Basic text generation
python inference_cpu.py --prompt "Hello, I'm a language model" --max-tokens 50
# Creative writing
python inference_cpu.py --prompt "Once upon a time" --max-tokens 120 --temperature 0.9
# More focused output
python inference_cpu.py --prompt "The key to machine learning is" --max-tokens 80 --temperature 0.7 --top-k 30
Python Interface
CUDA Version:
from inference_cuda import generate_direct, load_model_direct
# Load model once (requires CUDA)
model = load_model_direct("model_optimized.pt")
# Generate text with CUDA optimizations
result = generate_direct(
model,
"Hello, I'm a language model",
max_new_tokens=100, # Higher tokens for GPU
temperature=0.8,
top_k=50 # Higher top_k for better quality
)
print(result)
CPU Version:
from inference_cpu import generate_direct, load_model_direct
# Load model once (CPU only)
model = load_model_direct("model_optimized.pt")
# Generate text with CPU optimizations
result = generate_direct(
model,
"Hello, I'm a language model",
max_new_tokens=80, # Lower tokens for CPU efficiency
temperature=0.8,
top_k=30 # Lower top_k for CPU efficiency
)
print(result)
Chat Interface
CUDA Version:
# Start CUDA-optimized chat
python chat_cuda.py
CPU Version:
# Start CPU-optimized chat
python chat_cpu.py
Training Procedure
The model was trained using a modern GPT training recipe including:
- Flash Attention for efficient attention computation
- Mixed precision training with bfloat16
- Gradient accumulation to achieve large batch sizes
- TF32 for faster matrix multiplications
- Optimized vocabulary size (50,304) for better GPU utilization
Training Hyperparameters
- Learning rate schedule: Cosine decay from 1.5e-3 to 1.5e-4
- Gradient accumulation steps: Dynamically calculated
- Mixed precision: bfloat16 with PyTorch autocast
Evaluation
Testing Data
Evaluated on:
- FineWeb-edu validation set
- HellaSwag benchmark (10,042 examples)
Metrics
- Cross-entropy loss on validation set
- Accuracy on HellaSwag commonsense reasoning
Technical Specifications
Compute Infrastructure
- Hardware: 1x NVIDIA RTX A6000 (48GB VRAM)
- Software: PyTorch 2.0+, CUDA 12.1, Flash Attention 2
Model Initialization
- Weights initialized with σ=0.02 (scaled by 1/√(2×n_layers) for residual projections)
- Embeddings initialized with σ=0.02
- Biases initialized to zero
- LayerNorm weights initialized to 1.0, biases to 0.0
Citation
If you use this model, please cite:
@misc{ursa-minor-smashed,
author = {Kaileh57},
title = {Ursa Minor Smashed: Efficient GPT-2 Training},
year = {2024},
url = {https://github.com/Kaileh57/Ursa_Minor_Smashed}
}
Available Tools
Core Scripts
inference_cuda.py- CUDA-optimized inference scriptinference_cpu.py- CPU-optimized inference scriptchat_cuda.py- CUDA-optimized chat interfacechat_cpu.py- CPU-optimized chat interfacebenchmark_cuda.py- CUDA performance benchmarking toolbenchmark_cpu.py- CPU performance benchmarking toolconvert_to_gguf.py- Convert to GGUF format for llama.cpp
Examples
examples/basic_usage_cuda.py- CUDA usage examplesexamples/basic_usage_cpu.py- CPU usage examples
Parameters
Generation Parameters
temperature(0.1-1.0): Controls randomness (lower = more focused)top_k(1-100): Limit to top-k most likely tokenstop_p(0.1-1.0): Nucleus sampling thresholdrepetition_penalty(1.0-2.0): Reduce repetitive outputmax_tokens: Maximum tokens to generate
Recommended Settings
- Creative writing: temp=0.8-0.9, top_p=0.9, top_k=50
- Factual content: temp=0.3-0.5, top_p=0.8, top_k=20
- Code generation: temp=0.2-0.4, top_p=0.7, top_k=10
Performance
CUDA Performance (GPU):
- Inference Speed: ~50-150+ tokens/sec (depends on GPU)
- Memory Usage: ~2-4GB VRAM
- Features: CUDA autocast, torch.compile optimization
- Latency: ~10-20ms per token
CPU Performance:
- Inference Speed: ~15-25 tokens/sec
- Memory Usage: ~2-3GB RAM
- Features: Multi-threading, CPU-optimized parameters
- Latency: ~40-65ms per token
General:
- Context Length: 1024 tokens maximum
- Model Size: 124M parameters
Acknowledgments
- Based on Andrej Karpathy's nanoGPT implementation
- Trained on HuggingFace's FineWeb-edu dataset
- Uses OpenAI's GPT-2 tokenizer
- Downloads last month
- -
32-bit