File size: 8,434 Bytes
d575ce4 8691f4b d575ce4 8691f4b d575ce4 8691f4b d575ce4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 |
# Ursa_Minor_Smashed
A GPT-2 (124M parameters) model trained from scratch on the FineWeb-edu dataset using a modern training recipe. This model reproduces the GPT-2 small architecture with updated training techniques including Flash Attention, mixed precision training, and improved optimization strategies.
## Model Details
### Model Description
- **Developed by:** Kaileh57
- **Model type:** GPT-2 (Decoder-only Transformer)
- **Language(s):** English
- **License:** MIT
- **Finetuned from model:** Trained from scratch
### Model Architecture
| Parameter | Value |
|-----------|-------|
| Parameters | 124M |
| Layers | 12 |
| Hidden Size | 768 |
| Attention Heads | 12 |
| Context Length | 1024 |
| Vocabulary Size | 50,304 |
| Activation Function | GELU (tanh approximation) |
| Position Embeddings | Learned |
| Layer Norm | Pre-normalization |
| Attention Type | Multi-head with Flash Attention |
| Weight Tying | Token embeddings tied with output projection |
### Training Details
- **Dataset:** FineWeb-edu (10B token sample)
- **Training Regime:** Mixed precision (bfloat16)
- **Optimizer:** AdamW (β₁=0.9, β₂=0.95, ε=1e-8)
- **Learning Rate:** 1.5e-3 with cosine decay
- **Batch Size:** 524,288 tokens
- **Training Steps:** 19,073 (1 epoch)
- **Warmup Steps:** 715
- **Weight Decay:** 0.1
- **Gradient Clipping:** 1.0
- **Hardware:** NVIDIA RTX A6000 (48GB)
- **Training Time:** ~2 days
### Performance
| Benchmark | Score |
|-----------|-------|
| HellaSwag | 32.4% |
| Final Loss | 2.85 |
| Perplexity | ~17.3 |
*Note: Scores are from single epoch training. Multi-epoch training reaches ~35% on HellaSwag.*
## Uses
### Direct Use
This model can be used for:
- Text generation
- Research on efficient training methods
- Educational purposes to understand GPT architectures
- Fine-tuning for specific downstream tasks
### Out-of-Scope Use
- Production applications requiring high reliability
- Generation of factual information (model may hallucinate)
- Any use case requiring larger context than 1024 tokens
## Quick Start
### Installation
#### CPU Installation (Default)
```bash
# Clone the repository
git clone https://github.com/Kaileh57/Ursa_Minor_Smashed.git
cd Ursa_Minor_Smashed
# Set up CPU environment
pip install -r requirements.txt
# or use the setup script: ./setup.sh
```
#### CUDA Installation (For GPU Acceleration)
If you have a CUDA-capable GPU and want to use GPU acceleration:
**Linux/macOS:**
```bash
# Use the automated CUDA setup script
chmod +x setup-cuda.sh
./setup-cuda.sh
```
**Windows:**
```batch
# Use the Windows CUDA setup script
setup-cuda.bat
```
**Manual CUDA Installation:**
```bash
# Create separate CUDA environment
python -m venv venv-cuda
source venv-cuda/bin/activate # On Windows: venv-cuda\Scripts\activate
# Install CUDA requirements
pip install -r requirements-cuda.txt
```
**CUDA Requirements:**
- NVIDIA GPU with CUDA Compute Capability 3.5 or higher
- NVIDIA drivers installed
- CUDA 11.8 or 12.1 toolkit (optional, PyTorch includes CUDA runtime)
- At least 4GB GPU memory recommended
### Basic Usage
#### Command Line Interface
**CUDA Version (GPU):**
```bash
# Basic text generation
python inference_cuda.py --prompt "Hello, I'm a language model" --max-tokens 50
# Creative writing
python inference_cuda.py --prompt "Once upon a time" --max-tokens 200 --temperature 0.9
# More focused output
python inference_cuda.py --prompt "The key to machine learning is" --max-tokens 100 --temperature 0.7 --top-k 50
```
**CPU Version:**
```bash
# Basic text generation
python inference_cpu.py --prompt "Hello, I'm a language model" --max-tokens 50
# Creative writing
python inference_cpu.py --prompt "Once upon a time" --max-tokens 120 --temperature 0.9
# More focused output
python inference_cpu.py --prompt "The key to machine learning is" --max-tokens 80 --temperature 0.7 --top-k 30
```
#### Python Interface
**CUDA Version:**
```python
from inference_cuda import generate_direct, load_model_direct
# Load model once (requires CUDA)
model = load_model_direct("model_optimized.pt")
# Generate text with CUDA optimizations
result = generate_direct(
model,
"Hello, I'm a language model",
max_new_tokens=100, # Higher tokens for GPU
temperature=0.8,
top_k=50 # Higher top_k for better quality
)
print(result)
```
**CPU Version:**
```python
from inference_cpu import generate_direct, load_model_direct
# Load model once (CPU only)
model = load_model_direct("model_optimized.pt")
# Generate text with CPU optimizations
result = generate_direct(
model,
"Hello, I'm a language model",
max_new_tokens=80, # Lower tokens for CPU efficiency
temperature=0.8,
top_k=30 # Lower top_k for CPU efficiency
)
print(result)
```
#### Chat Interface
**CUDA Version:**
```bash
# Start CUDA-optimized chat
python chat_cuda.py
```
**CPU Version:**
```bash
# Start CPU-optimized chat
python chat_cpu.py
```
## Training Procedure
The model was trained using a modern GPT training recipe including:
- Flash Attention for efficient attention computation
- Mixed precision training with bfloat16
- Gradient accumulation to achieve large batch sizes
- TF32 for faster matrix multiplications
- Optimized vocabulary size (50,304) for better GPU utilization
### Training Hyperparameters
- **Learning rate schedule:** Cosine decay from 1.5e-3 to 1.5e-4
- **Gradient accumulation steps:** Dynamically calculated
- **Mixed precision:** bfloat16 with PyTorch autocast
## Evaluation
### Testing Data
Evaluated on:
- FineWeb-edu validation set
- HellaSwag benchmark (10,042 examples)
### Metrics
- Cross-entropy loss on validation set
- Accuracy on HellaSwag commonsense reasoning
## Technical Specifications
### Compute Infrastructure
- **Hardware:** 1x NVIDIA RTX A6000 (48GB VRAM)
- **Software:** PyTorch 2.0+, CUDA 12.1, Flash Attention 2
### Model Initialization
- Weights initialized with σ=0.02 (scaled by 1/√(2×n_layers) for residual projections)
- Embeddings initialized with σ=0.02
- Biases initialized to zero
- LayerNorm weights initialized to 1.0, biases to 0.0
## Citation
If you use this model, please cite:
```bibtex
@misc{ursa-minor-smashed,
author = {Kaileh57},
title = {Ursa Minor Smashed: Efficient GPT-2 Training},
year = {2024},
url = {https://github.com/Kaileh57/Ursa_Minor_Smashed}
}
```
## Available Tools
### Core Scripts
- **`inference_cuda.py`** - CUDA-optimized inference script
- **`inference_cpu.py`** - CPU-optimized inference script
- **`chat_cuda.py`** - CUDA-optimized chat interface
- **`chat_cpu.py`** - CPU-optimized chat interface
- **`benchmark_cuda.py`** - CUDA performance benchmarking tool
- **`benchmark_cpu.py`** - CPU performance benchmarking tool
- **`convert_to_gguf.py`** - Convert to GGUF format for llama.cpp
### Examples
- **`examples/basic_usage_cuda.py`** - CUDA usage examples
- **`examples/basic_usage_cpu.py`** - CPU usage examples
### Parameters
#### Generation Parameters
- **`temperature`** (0.1-1.0): Controls randomness (lower = more focused)
- **`top_k`** (1-100): Limit to top-k most likely tokens
- **`top_p`** (0.1-1.0): Nucleus sampling threshold
- **`repetition_penalty`** (1.0-2.0): Reduce repetitive output
- **`max_tokens`**: Maximum tokens to generate
#### Recommended Settings
- **Creative writing**: temp=0.8-0.9, top_p=0.9, top_k=50
- **Factual content**: temp=0.3-0.5, top_p=0.8, top_k=20
- **Code generation**: temp=0.2-0.4, top_p=0.7, top_k=10
## Performance
**CUDA Performance (GPU)**:
- **Inference Speed**: ~50-150+ tokens/sec (depends on GPU)
- **Memory Usage**: ~2-4GB VRAM
- **Features**: CUDA autocast, torch.compile optimization
- **Latency**: ~10-20ms per token
**CPU Performance**:
- **Inference Speed**: ~15-25 tokens/sec
- **Memory Usage**: ~2-3GB RAM
- **Features**: Multi-threading, CPU-optimized parameters
- **Latency**: ~40-65ms per token
**General**:
- **Context Length**: 1024 tokens maximum
- **Model Size**: 124M parameters
## Acknowledgments
- Based on Andrej Karpathy's nanoGPT implementation
- Trained on HuggingFace's FineWeb-edu dataset
- Uses OpenAI's GPT-2 tokenizer |