Ursa_Minor_Smashed / README.md
Kaileh57's picture
Upload folder using huggingface_hub
8691f4b verified
# Ursa_Minor_Smashed
A GPT-2 (124M parameters) model trained from scratch on the FineWeb-edu dataset using a modern training recipe. This model reproduces the GPT-2 small architecture with updated training techniques including Flash Attention, mixed precision training, and improved optimization strategies.
## Model Details
### Model Description
- **Developed by:** Kaileh57
- **Model type:** GPT-2 (Decoder-only Transformer)
- **Language(s):** English
- **License:** MIT
- **Finetuned from model:** Trained from scratch
### Model Architecture
| Parameter | Value |
|-----------|-------|
| Parameters | 124M |
| Layers | 12 |
| Hidden Size | 768 |
| Attention Heads | 12 |
| Context Length | 1024 |
| Vocabulary Size | 50,304 |
| Activation Function | GELU (tanh approximation) |
| Position Embeddings | Learned |
| Layer Norm | Pre-normalization |
| Attention Type | Multi-head with Flash Attention |
| Weight Tying | Token embeddings tied with output projection |
### Training Details
- **Dataset:** FineWeb-edu (10B token sample)
- **Training Regime:** Mixed precision (bfloat16)
- **Optimizer:** AdamW (β₁=0.9, β₂=0.95, ε=1e-8)
- **Learning Rate:** 1.5e-3 with cosine decay
- **Batch Size:** 524,288 tokens
- **Training Steps:** 19,073 (1 epoch)
- **Warmup Steps:** 715
- **Weight Decay:** 0.1
- **Gradient Clipping:** 1.0
- **Hardware:** NVIDIA RTX A6000 (48GB)
- **Training Time:** ~2 days
### Performance
| Benchmark | Score |
|-----------|-------|
| HellaSwag | 32.4% |
| Final Loss | 2.85 |
| Perplexity | ~17.3 |
*Note: Scores are from single epoch training. Multi-epoch training reaches ~35% on HellaSwag.*
## Uses
### Direct Use
This model can be used for:
- Text generation
- Research on efficient training methods
- Educational purposes to understand GPT architectures
- Fine-tuning for specific downstream tasks
### Out-of-Scope Use
- Production applications requiring high reliability
- Generation of factual information (model may hallucinate)
- Any use case requiring larger context than 1024 tokens
## Quick Start
### Installation
#### CPU Installation (Default)
```bash
# Clone the repository
git clone https://github.com/Kaileh57/Ursa_Minor_Smashed.git
cd Ursa_Minor_Smashed
# Set up CPU environment
pip install -r requirements.txt
# or use the setup script: ./setup.sh
```
#### CUDA Installation (For GPU Acceleration)
If you have a CUDA-capable GPU and want to use GPU acceleration:
**Linux/macOS:**
```bash
# Use the automated CUDA setup script
chmod +x setup-cuda.sh
./setup-cuda.sh
```
**Windows:**
```batch
# Use the Windows CUDA setup script
setup-cuda.bat
```
**Manual CUDA Installation:**
```bash
# Create separate CUDA environment
python -m venv venv-cuda
source venv-cuda/bin/activate # On Windows: venv-cuda\Scripts\activate
# Install CUDA requirements
pip install -r requirements-cuda.txt
```
**CUDA Requirements:**
- NVIDIA GPU with CUDA Compute Capability 3.5 or higher
- NVIDIA drivers installed
- CUDA 11.8 or 12.1 toolkit (optional, PyTorch includes CUDA runtime)
- At least 4GB GPU memory recommended
### Basic Usage
#### Command Line Interface
**CUDA Version (GPU):**
```bash
# Basic text generation
python inference_cuda.py --prompt "Hello, I'm a language model" --max-tokens 50
# Creative writing
python inference_cuda.py --prompt "Once upon a time" --max-tokens 200 --temperature 0.9
# More focused output
python inference_cuda.py --prompt "The key to machine learning is" --max-tokens 100 --temperature 0.7 --top-k 50
```
**CPU Version:**
```bash
# Basic text generation
python inference_cpu.py --prompt "Hello, I'm a language model" --max-tokens 50
# Creative writing
python inference_cpu.py --prompt "Once upon a time" --max-tokens 120 --temperature 0.9
# More focused output
python inference_cpu.py --prompt "The key to machine learning is" --max-tokens 80 --temperature 0.7 --top-k 30
```
#### Python Interface
**CUDA Version:**
```python
from inference_cuda import generate_direct, load_model_direct
# Load model once (requires CUDA)
model = load_model_direct("model_optimized.pt")
# Generate text with CUDA optimizations
result = generate_direct(
model,
"Hello, I'm a language model",
max_new_tokens=100, # Higher tokens for GPU
temperature=0.8,
top_k=50 # Higher top_k for better quality
)
print(result)
```
**CPU Version:**
```python
from inference_cpu import generate_direct, load_model_direct
# Load model once (CPU only)
model = load_model_direct("model_optimized.pt")
# Generate text with CPU optimizations
result = generate_direct(
model,
"Hello, I'm a language model",
max_new_tokens=80, # Lower tokens for CPU efficiency
temperature=0.8,
top_k=30 # Lower top_k for CPU efficiency
)
print(result)
```
#### Chat Interface
**CUDA Version:**
```bash
# Start CUDA-optimized chat
python chat_cuda.py
```
**CPU Version:**
```bash
# Start CPU-optimized chat
python chat_cpu.py
```
## Training Procedure
The model was trained using a modern GPT training recipe including:
- Flash Attention for efficient attention computation
- Mixed precision training with bfloat16
- Gradient accumulation to achieve large batch sizes
- TF32 for faster matrix multiplications
- Optimized vocabulary size (50,304) for better GPU utilization
### Training Hyperparameters
- **Learning rate schedule:** Cosine decay from 1.5e-3 to 1.5e-4
- **Gradient accumulation steps:** Dynamically calculated
- **Mixed precision:** bfloat16 with PyTorch autocast
## Evaluation
### Testing Data
Evaluated on:
- FineWeb-edu validation set
- HellaSwag benchmark (10,042 examples)
### Metrics
- Cross-entropy loss on validation set
- Accuracy on HellaSwag commonsense reasoning
## Technical Specifications
### Compute Infrastructure
- **Hardware:** 1x NVIDIA RTX A6000 (48GB VRAM)
- **Software:** PyTorch 2.0+, CUDA 12.1, Flash Attention 2
### Model Initialization
- Weights initialized with σ=0.02 (scaled by 1/√(2×n_layers) for residual projections)
- Embeddings initialized with σ=0.02
- Biases initialized to zero
- LayerNorm weights initialized to 1.0, biases to 0.0
## Citation
If you use this model, please cite:
```bibtex
@misc{ursa-minor-smashed,
author = {Kaileh57},
title = {Ursa Minor Smashed: Efficient GPT-2 Training},
year = {2024},
url = {https://github.com/Kaileh57/Ursa_Minor_Smashed}
}
```
## Available Tools
### Core Scripts
- **`inference_cuda.py`** - CUDA-optimized inference script
- **`inference_cpu.py`** - CPU-optimized inference script
- **`chat_cuda.py`** - CUDA-optimized chat interface
- **`chat_cpu.py`** - CPU-optimized chat interface
- **`benchmark_cuda.py`** - CUDA performance benchmarking tool
- **`benchmark_cpu.py`** - CPU performance benchmarking tool
- **`convert_to_gguf.py`** - Convert to GGUF format for llama.cpp
### Examples
- **`examples/basic_usage_cuda.py`** - CUDA usage examples
- **`examples/basic_usage_cpu.py`** - CPU usage examples
### Parameters
#### Generation Parameters
- **`temperature`** (0.1-1.0): Controls randomness (lower = more focused)
- **`top_k`** (1-100): Limit to top-k most likely tokens
- **`top_p`** (0.1-1.0): Nucleus sampling threshold
- **`repetition_penalty`** (1.0-2.0): Reduce repetitive output
- **`max_tokens`**: Maximum tokens to generate
#### Recommended Settings
- **Creative writing**: temp=0.8-0.9, top_p=0.9, top_k=50
- **Factual content**: temp=0.3-0.5, top_p=0.8, top_k=20
- **Code generation**: temp=0.2-0.4, top_p=0.7, top_k=10
## Performance
**CUDA Performance (GPU)**:
- **Inference Speed**: ~50-150+ tokens/sec (depends on GPU)
- **Memory Usage**: ~2-4GB VRAM
- **Features**: CUDA autocast, torch.compile optimization
- **Latency**: ~10-20ms per token
**CPU Performance**:
- **Inference Speed**: ~15-25 tokens/sec
- **Memory Usage**: ~2-3GB RAM
- **Features**: Multi-threading, CPU-optimized parameters
- **Latency**: ~40-65ms per token
**General**:
- **Context Length**: 1024 tokens maximum
- **Model Size**: 124M parameters
## Acknowledgments
- Based on Andrej Karpathy's nanoGPT implementation
- Trained on HuggingFace's FineWeb-edu dataset
- Uses OpenAI's GPT-2 tokenizer