# Ursa_Minor_Smashed

A GPT-2 (124M parameters) model trained from scratch on the FineWeb-edu dataset using a modern training recipe. This model reproduces the GPT-2 small architecture with updated training techniques including Flash Attention, mixed precision training, and improved optimization strategies.

## Model Details

### Model Description
- **Developed by:** Kaileh57
- **Model type:** GPT-2 (Decoder-only Transformer)
- **Language(s):** English
- **License:** MIT
- **Finetuned from model:** Trained from scratch

### Model Architecture

| Parameter | Value |
|-----------|-------|
| Parameters | 124M |
| Layers | 12 |
| Hidden Size | 768 |
| Attention Heads | 12 |
| Context Length | 1024 |
| Vocabulary Size | 50,304 |
| Activation Function | GELU (tanh approximation) |
| Position Embeddings | Learned |
| Layer Norm | Pre-normalization |
| Attention Type | Multi-head with Flash Attention |
| Weight Tying | Token embeddings tied with output projection |

### Training Details

- **Dataset:** FineWeb-edu (10B token sample)
- **Training Regime:** Mixed precision (bfloat16)
- **Optimizer:** AdamW (β₁=0.9, β₂=0.95, ε=1e-8)
- **Learning Rate:** 1.5e-3 with cosine decay
- **Batch Size:** 524,288 tokens
- **Training Steps:** 19,073 (1 epoch)
- **Warmup Steps:** 715
- **Weight Decay:** 0.1
- **Gradient Clipping:** 1.0
- **Hardware:** NVIDIA RTX A6000 (48GB)
- **Training Time:** ~2 days

### Performance

| Benchmark | Score |
|-----------|-------|
| HellaSwag | 32.4% |
| Final Loss | 2.85 |
| Perplexity | ~17.3 |

*Note: Scores are from single epoch training. Multi-epoch training reaches ~35% on HellaSwag.*

## Uses

### Direct Use
This model can be used for:
- Text generation
- Research on efficient training methods
- Educational purposes to understand GPT architectures
- Fine-tuning for specific downstream tasks

### Out-of-Scope Use
- Production applications requiring high reliability
- Generation of factual information (model may hallucinate)
- Any use case requiring larger context than 1024 tokens

## Quick Start

### Installation

#### CPU Installation (Default)
```bash
# Clone the repository
git clone https://github.com/Kaileh57/Ursa_Minor_Smashed.git
cd Ursa_Minor_Smashed

# Set up CPU environment
pip install -r requirements.txt
# or use the setup script: ./setup.sh
```

#### CUDA Installation (For GPU Acceleration)
If you have a CUDA-capable GPU and want to use GPU acceleration:

**Linux/macOS:**
```bash
# Use the automated CUDA setup script
chmod +x setup-cuda.sh
./setup-cuda.sh
```

**Windows:**
```batch
# Use the Windows CUDA setup script
setup-cuda.bat
```

**Manual CUDA Installation:**
```bash
# Create separate CUDA environment
python -m venv venv-cuda
source venv-cuda/bin/activate  # On Windows: venv-cuda\Scripts\activate

# Install CUDA requirements
pip install -r requirements-cuda.txt
```

**CUDA Requirements:**
- NVIDIA GPU with CUDA Compute Capability 3.5 or higher
- NVIDIA drivers installed
- CUDA 11.8 or 12.1 toolkit (optional, PyTorch includes CUDA runtime)
- At least 4GB GPU memory recommended

### Basic Usage

#### Command Line Interface

**CUDA Version (GPU):**
```bash
# Basic text generation
python inference_cuda.py --prompt "Hello, I'm a language model" --max-tokens 50

# Creative writing
python inference_cuda.py --prompt "Once upon a time" --max-tokens 200 --temperature 0.9

# More focused output
python inference_cuda.py --prompt "The key to machine learning is" --max-tokens 100 --temperature 0.7 --top-k 50
```

**CPU Version:**
```bash
# Basic text generation
python inference_cpu.py --prompt "Hello, I'm a language model" --max-tokens 50

# Creative writing
python inference_cpu.py --prompt "Once upon a time" --max-tokens 120 --temperature 0.9

# More focused output
python inference_cpu.py --prompt "The key to machine learning is" --max-tokens 80 --temperature 0.7 --top-k 30
```

#### Python Interface

**CUDA Version:**
```python
from inference_cuda import generate_direct, load_model_direct

# Load model once (requires CUDA)
model = load_model_direct("model_optimized.pt")

# Generate text with CUDA optimizations
result = generate_direct(
    model, 
    "Hello, I'm a language model", 
    max_new_tokens=100,  # Higher tokens for GPU
    temperature=0.8,
    top_k=50            # Higher top_k for better quality
)
print(result)
```

**CPU Version:**
```python
from inference_cpu import generate_direct, load_model_direct

# Load model once (CPU only)
model = load_model_direct("model_optimized.pt")

# Generate text with CPU optimizations
result = generate_direct(
    model, 
    "Hello, I'm a language model", 
    max_new_tokens=80,   # Lower tokens for CPU efficiency
    temperature=0.8,
    top_k=30            # Lower top_k for CPU efficiency
)
print(result)
```

#### Chat Interface

**CUDA Version:**
```bash
# Start CUDA-optimized chat
python chat_cuda.py
```

**CPU Version:**
```bash
# Start CPU-optimized chat
python chat_cpu.py
```

## Training Procedure

The model was trained using a modern GPT training recipe including:
- Flash Attention for efficient attention computation
- Mixed precision training with bfloat16
- Gradient accumulation to achieve large batch sizes
- TF32 for faster matrix multiplications
- Optimized vocabulary size (50,304) for better GPU utilization

### Training Hyperparameters
- **Learning rate schedule:** Cosine decay from 1.5e-3 to 1.5e-4
- **Gradient accumulation steps:** Dynamically calculated
- **Mixed precision:** bfloat16 with PyTorch autocast

## Evaluation

### Testing Data
Evaluated on:
- FineWeb-edu validation set
- HellaSwag benchmark (10,042 examples)

### Metrics
- Cross-entropy loss on validation set
- Accuracy on HellaSwag commonsense reasoning

## Technical Specifications

### Compute Infrastructure
- **Hardware:** 1x NVIDIA RTX A6000 (48GB VRAM)
- **Software:** PyTorch 2.0+, CUDA 12.1, Flash Attention 2

### Model Initialization
- Weights initialized with σ=0.02 (scaled by 1/√(2×n_layers) for residual projections)
- Embeddings initialized with σ=0.02
- Biases initialized to zero
- LayerNorm weights initialized to 1.0, biases to 0.0

## Citation

If you use this model, please cite:

```bibtex
@misc{ursa-minor-smashed,
  author = {Kaileh57},
  title = {Ursa Minor Smashed: Efficient GPT-2 Training},
  year = {2024},
  url = {https://github.com/Kaileh57/Ursa_Minor_Smashed}
}
```

## Available Tools

### Core Scripts
- **`inference_cuda.py`** - CUDA-optimized inference script
- **`inference_cpu.py`** - CPU-optimized inference script
- **`chat_cuda.py`** - CUDA-optimized chat interface
- **`chat_cpu.py`** - CPU-optimized chat interface
- **`benchmark_cuda.py`** - CUDA performance benchmarking tool
- **`benchmark_cpu.py`** - CPU performance benchmarking tool
- **`convert_to_gguf.py`** - Convert to GGUF format for llama.cpp

### Examples
- **`examples/basic_usage_cuda.py`** - CUDA usage examples
- **`examples/basic_usage_cpu.py`** - CPU usage examples

### Parameters

#### Generation Parameters
- **`temperature`** (0.1-1.0): Controls randomness (lower = more focused)
- **`top_k`** (1-100): Limit to top-k most likely tokens
- **`top_p`** (0.1-1.0): Nucleus sampling threshold
- **`repetition_penalty`** (1.0-2.0): Reduce repetitive output
- **`max_tokens`**: Maximum tokens to generate

#### Recommended Settings
- **Creative writing**: temp=0.8-0.9, top_p=0.9, top_k=50
- **Factual content**: temp=0.3-0.5, top_p=0.8, top_k=20
- **Code generation**: temp=0.2-0.4, top_p=0.7, top_k=10

## Performance

**CUDA Performance (GPU)**:
- **Inference Speed**: ~50-150+ tokens/sec (depends on GPU)
- **Memory Usage**: ~2-4GB VRAM
- **Features**: CUDA autocast, torch.compile optimization
- **Latency**: ~10-20ms per token

**CPU Performance**:
- **Inference Speed**: ~15-25 tokens/sec
- **Memory Usage**: ~2-3GB RAM
- **Features**: Multi-threading, CPU-optimized parameters
- **Latency**: ~40-65ms per token

**General**:
- **Context Length**: 1024 tokens maximum
- **Model Size**: 124M parameters

## Acknowledgments
- Based on Andrej Karpathy's nanoGPT implementation
- Trained on HuggingFace's FineWeb-edu dataset
- Uses OpenAI's GPT-2 tokenizer