File size: 8,434 Bytes

# Ursa_Minor_Smashed

A GPT-2 (124M parameters) model trained from scratch on the FineWeb-edu dataset using a modern training recipe. This model reproduces the GPT-2 small architecture with updated training techniques including Flash Attention, mixed precision training, and improved optimization strategies.

## Model Details

### Model Description
- **Developed by:** Kaileh57
- **Model type:** GPT-2 (Decoder-only Transformer)
- **Language(s):** English
- **License:** MIT
- **Finetuned from model:** Trained from scratch

### Model Architecture

| Parameter | Value |
|-----------|-------|
| Parameters | 124M |
| Layers | 12 |
| Hidden Size | 768 |
| Attention Heads | 12 |
| Context Length | 1024 |
| Vocabulary Size | 50,304 |
| Activation Function | GELU (tanh approximation) |
| Position Embeddings | Learned |
| Layer Norm | Pre-normalization |
| Attention Type | Multi-head with Flash Attention |
| Weight Tying | Token embeddings tied with output projection |

### Training Details

- **Dataset:** FineWeb-edu (10B token sample)
- **Training Regime:** Mixed precision (bfloat16)
- **Optimizer:** AdamW (β₁=0.9, β₂=0.95, ε=1e-8)
- **Learning Rate:** 1.5e-3 with cosine decay
- **Batch Size:** 524,288 tokens
- **Training Steps:** 19,073 (1 epoch)
- **Warmup Steps:** 715
- **Weight Decay:** 0.1
- **Gradient Clipping:** 1.0
- **Hardware:** NVIDIA RTX A6000 (48GB)
- **Training Time:** ~2 days

### Performance

| Benchmark | Score |
|-----------|-------|
| HellaSwag | 32.4% |
| Final Loss | 2.85 |
| Perplexity | ~17.3 |

*Note: Scores are from single epoch training. Multi-epoch training reaches ~35% on HellaSwag.*

## Uses

### Direct Use
This model can be used for:
- Text generation
- Research on efficient training methods
- Educational purposes to understand GPT architectures
- Fine-tuning for specific downstream tasks

### Out-of-Scope Use
- Production applications requiring high reliability
- Generation of factual information (model may hallucinate)
- Any use case requiring larger context than 1024 tokens

## Quick Start

### Installation

#### CPU Installation (Default)
```bash

# Clone the repository

git clone https://github.com/Kaileh57/Ursa_Minor_Smashed.git

cd Ursa_Minor_Smashed



# Set up CPU environment

pip install -r requirements.txt

# or use the setup script: ./setup.sh

```

#### CUDA Installation (For GPU Acceleration)
If you have a CUDA-capable GPU and want to use GPU acceleration:

**Linux/macOS:**
```bash

# Use the automated CUDA setup script

chmod +x setup-cuda.sh

./setup-cuda.sh

```

**Windows:**
```batch

# Use the Windows CUDA setup script

setup-cuda.bat

```

**Manual CUDA Installation:**
```bash

# Create separate CUDA environment

python -m venv venv-cuda

source venv-cuda/bin/activate  # On Windows: venv-cuda\Scripts\activate



# Install CUDA requirements

pip install -r requirements-cuda.txt

```

**CUDA Requirements:**
- NVIDIA GPU with CUDA Compute Capability 3.5 or higher
- NVIDIA drivers installed
- CUDA 11.8 or 12.1 toolkit (optional, PyTorch includes CUDA runtime)
- At least 4GB GPU memory recommended

### Basic Usage

#### Command Line Interface

**CUDA Version (GPU):**
```bash

# Basic text generation

python inference_cuda.py --prompt "Hello, I'm a language model" --max-tokens 50



# Creative writing

python inference_cuda.py --prompt "Once upon a time" --max-tokens 200 --temperature 0.9



# More focused output

python inference_cuda.py --prompt "The key to machine learning is" --max-tokens 100 --temperature 0.7 --top-k 50

```

**CPU Version:**
```bash

# Basic text generation

python inference_cpu.py --prompt "Hello, I'm a language model" --max-tokens 50



# Creative writing

python inference_cpu.py --prompt "Once upon a time" --max-tokens 120 --temperature 0.9



# More focused output

python inference_cpu.py --prompt "The key to machine learning is" --max-tokens 80 --temperature 0.7 --top-k 30

```

#### Python Interface

**CUDA Version:**
```python

from inference_cuda import generate_direct, load_model_direct



# Load model once (requires CUDA)

model = load_model_direct("model_optimized.pt")



# Generate text with CUDA optimizations

result = generate_direct(

    model, 

    "Hello, I'm a language model", 

    max_new_tokens=100,  # Higher tokens for GPU

    temperature=0.8,

    top_k=50            # Higher top_k for better quality

)

print(result)

```

**CPU Version:**
```python

from inference_cpu import generate_direct, load_model_direct



# Load model once (CPU only)

model = load_model_direct("model_optimized.pt")



# Generate text with CPU optimizations

result = generate_direct(

    model, 

    "Hello, I'm a language model", 

    max_new_tokens=80,   # Lower tokens for CPU efficiency

    temperature=0.8,

    top_k=30            # Lower top_k for CPU efficiency

)

print(result)

```

#### Chat Interface

**CUDA Version:**
```bash

# Start CUDA-optimized chat

python chat_cuda.py

```

**CPU Version:**
```bash

# Start CPU-optimized chat

python chat_cpu.py

```

## Training Procedure

The model was trained using a modern GPT training recipe including:
- Flash Attention for efficient attention computation
- Mixed precision training with bfloat16
- Gradient accumulation to achieve large batch sizes
- TF32 for faster matrix multiplications
- Optimized vocabulary size (50,304) for better GPU utilization

### Training Hyperparameters
- **Learning rate schedule:** Cosine decay from 1.5e-3 to 1.5e-4
- **Gradient accumulation steps:** Dynamically calculated
- **Mixed precision:** bfloat16 with PyTorch autocast

## Evaluation

### Testing Data
Evaluated on:
- FineWeb-edu validation set
- HellaSwag benchmark (10,042 examples)

### Metrics
- Cross-entropy loss on validation set
- Accuracy on HellaSwag commonsense reasoning

## Technical Specifications

### Compute Infrastructure
- **Hardware:** 1x NVIDIA RTX A6000 (48GB VRAM)
- **Software:** PyTorch 2.0+, CUDA 12.1, Flash Attention 2

### Model Initialization
- Weights initialized with σ=0.02 (scaled by 1/√(2×n_layers) for residual projections)

- Embeddings initialized with σ=0.02

- Biases initialized to zero

- LayerNorm weights initialized to 1.0, biases to 0.0



## Citation



If you use this model, please cite:



```bibtex

@misc{ursa-minor-smashed,

  author = {Kaileh57},

  title = {Ursa Minor Smashed: Efficient GPT-2 Training},

  year = {2024},

  url = {https://github.com/Kaileh57/Ursa_Minor_Smashed}

}

```



## Available Tools



### Core Scripts

- **`inference_cuda.py`** - CUDA-optimized inference script

- **`inference_cpu.py`** - CPU-optimized inference script

- **`chat_cuda.py`** - CUDA-optimized chat interface

- **`chat_cpu.py`** - CPU-optimized chat interface

- **`benchmark_cuda.py`** - CUDA performance benchmarking tool

- **`benchmark_cpu.py`** - CPU performance benchmarking tool

- **`convert_to_gguf.py`** - Convert to GGUF format for llama.cpp



### Examples

- **`examples/basic_usage_cuda.py`** - CUDA usage examples

- **`examples/basic_usage_cpu.py`** - CPU usage examples



### Parameters



#### Generation Parameters

- **`temperature`** (0.1-1.0): Controls randomness (lower = more focused)

- **`top_k`** (1-100): Limit to top-k most likely tokens

- **`top_p`** (0.1-1.0): Nucleus sampling threshold

- **`repetition_penalty`** (1.0-2.0): Reduce repetitive output

- **`max_tokens`**: Maximum tokens to generate



#### Recommended Settings

- **Creative writing**: temp=0.8-0.9, top_p=0.9, top_k=50

- **Factual content**: temp=0.3-0.5, top_p=0.8, top_k=20

- **Code generation**: temp=0.2-0.4, top_p=0.7, top_k=10



## Performance



**CUDA Performance (GPU)**:

- **Inference Speed**: ~50-150+ tokens/sec (depends on GPU)

- **Memory Usage**: ~2-4GB VRAM

- **Features**: CUDA autocast, torch.compile optimization

- **Latency**: ~10-20ms per token



**CPU Performance**:

- **Inference Speed**: ~15-25 tokens/sec

- **Memory Usage**: ~2-3GB RAM

- **Features**: Multi-threading, CPU-optimized parameters

- **Latency**: ~40-65ms per token



**General**:

- **Context Length**: 1024 tokens maximum

- **Model Size**: 124M parameters



## Acknowledgments

- Based on Andrej Karpathy's nanoGPT implementation

- Trained on HuggingFace's FineWeb-edu dataset

- Uses OpenAI's GPT-2 tokenizer