YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Ursa_Minor_Smashed

A GPT-2 (124M parameters) model trained from scratch on the FineWeb-edu dataset using a modern training recipe. This model reproduces the GPT-2 small architecture with updated training techniques including Flash Attention, mixed precision training, and improved optimization strategies.

Model Details

Model Description

  • Developed by: Kaileh57
  • Model type: GPT-2 (Decoder-only Transformer)
  • Language(s): English
  • License: MIT
  • Finetuned from model: Trained from scratch

Model Architecture

Parameter Value
Parameters 124M
Layers 12
Hidden Size 768
Attention Heads 12
Context Length 1024
Vocabulary Size 50,304
Activation Function GELU (tanh approximation)
Position Embeddings Learned
Layer Norm Pre-normalization
Attention Type Multi-head with Flash Attention
Weight Tying Token embeddings tied with output projection

Training Details

  • Dataset: FineWeb-edu (10B token sample)
  • Training Regime: Mixed precision (bfloat16)
  • Optimizer: AdamW (β₁=0.9, β₂=0.95, ε=1e-8)
  • Learning Rate: 1.5e-3 with cosine decay
  • Batch Size: 524,288 tokens
  • Training Steps: 19,073 (1 epoch)
  • Warmup Steps: 715
  • Weight Decay: 0.1
  • Gradient Clipping: 1.0
  • Hardware: NVIDIA RTX A6000 (48GB)
  • Training Time: ~2 days

Performance

Benchmark Score
HellaSwag 32.4%
Final Loss 2.85
Perplexity ~17.3

Note: Scores are from single epoch training. Multi-epoch training reaches ~35% on HellaSwag.

Uses

Direct Use

This model can be used for:

  • Text generation
  • Research on efficient training methods
  • Educational purposes to understand GPT architectures
  • Fine-tuning for specific downstream tasks

Out-of-Scope Use

  • Production applications requiring high reliability
  • Generation of factual information (model may hallucinate)
  • Any use case requiring larger context than 1024 tokens

Quick Start

Installation

CPU Installation (Default)

# Clone the repository
git clone https://github.com/Kaileh57/Ursa_Minor_Smashed.git
cd Ursa_Minor_Smashed

# Set up CPU environment
pip install -r requirements.txt
# or use the setup script: ./setup.sh

CUDA Installation (For GPU Acceleration)

If you have a CUDA-capable GPU and want to use GPU acceleration:

Linux/macOS:

# Use the automated CUDA setup script
chmod +x setup-cuda.sh
./setup-cuda.sh

Windows:

# Use the Windows CUDA setup script
setup-cuda.bat

Manual CUDA Installation:

# Create separate CUDA environment
python -m venv venv-cuda
source venv-cuda/bin/activate  # On Windows: venv-cuda\Scripts\activate

# Install CUDA requirements
pip install -r requirements-cuda.txt

CUDA Requirements:

  • NVIDIA GPU with CUDA Compute Capability 3.5 or higher
  • NVIDIA drivers installed
  • CUDA 11.8 or 12.1 toolkit (optional, PyTorch includes CUDA runtime)
  • At least 4GB GPU memory recommended

Basic Usage

Command Line Interface

CUDA Version (GPU):

# Basic text generation
python inference_cuda.py --prompt "Hello, I'm a language model" --max-tokens 50

# Creative writing
python inference_cuda.py --prompt "Once upon a time" --max-tokens 200 --temperature 0.9

# More focused output
python inference_cuda.py --prompt "The key to machine learning is" --max-tokens 100 --temperature 0.7 --top-k 50

CPU Version:

# Basic text generation
python inference_cpu.py --prompt "Hello, I'm a language model" --max-tokens 50

# Creative writing
python inference_cpu.py --prompt "Once upon a time" --max-tokens 120 --temperature 0.9

# More focused output
python inference_cpu.py --prompt "The key to machine learning is" --max-tokens 80 --temperature 0.7 --top-k 30

Python Interface

CUDA Version:

from inference_cuda import generate_direct, load_model_direct

# Load model once (requires CUDA)
model = load_model_direct("model_optimized.pt")

# Generate text with CUDA optimizations
result = generate_direct(
    model, 
    "Hello, I'm a language model", 
    max_new_tokens=100,  # Higher tokens for GPU
    temperature=0.8,
    top_k=50            # Higher top_k for better quality
)
print(result)

CPU Version:

from inference_cpu import generate_direct, load_model_direct

# Load model once (CPU only)
model = load_model_direct("model_optimized.pt")

# Generate text with CPU optimizations
result = generate_direct(
    model, 
    "Hello, I'm a language model", 
    max_new_tokens=80,   # Lower tokens for CPU efficiency
    temperature=0.8,
    top_k=30            # Lower top_k for CPU efficiency
)
print(result)

Chat Interface

CUDA Version:

# Start CUDA-optimized chat
python chat_cuda.py

CPU Version:

# Start CPU-optimized chat
python chat_cpu.py

Training Procedure

The model was trained using a modern GPT training recipe including:

  • Flash Attention for efficient attention computation
  • Mixed precision training with bfloat16
  • Gradient accumulation to achieve large batch sizes
  • TF32 for faster matrix multiplications
  • Optimized vocabulary size (50,304) for better GPU utilization

Training Hyperparameters

  • Learning rate schedule: Cosine decay from 1.5e-3 to 1.5e-4
  • Gradient accumulation steps: Dynamically calculated
  • Mixed precision: bfloat16 with PyTorch autocast

Evaluation

Testing Data

Evaluated on:

  • FineWeb-edu validation set
  • HellaSwag benchmark (10,042 examples)

Metrics

  • Cross-entropy loss on validation set
  • Accuracy on HellaSwag commonsense reasoning

Technical Specifications

Compute Infrastructure

  • Hardware: 1x NVIDIA RTX A6000 (48GB VRAM)
  • Software: PyTorch 2.0+, CUDA 12.1, Flash Attention 2

Model Initialization

  • Weights initialized with σ=0.02 (scaled by 1/√(2×n_layers) for residual projections)
  • Embeddings initialized with σ=0.02
  • Biases initialized to zero
  • LayerNorm weights initialized to 1.0, biases to 0.0

Citation

If you use this model, please cite:

@misc{ursa-minor-smashed,
  author = {Kaileh57},
  title = {Ursa Minor Smashed: Efficient GPT-2 Training},
  year = {2024},
  url = {https://github.com/Kaileh57/Ursa_Minor_Smashed}
}

Available Tools

Core Scripts

  • inference_cuda.py - CUDA-optimized inference script
  • inference_cpu.py - CPU-optimized inference script
  • chat_cuda.py - CUDA-optimized chat interface
  • chat_cpu.py - CPU-optimized chat interface
  • benchmark_cuda.py - CUDA performance benchmarking tool
  • benchmark_cpu.py - CPU performance benchmarking tool
  • convert_to_gguf.py - Convert to GGUF format for llama.cpp

Examples

  • examples/basic_usage_cuda.py - CUDA usage examples
  • examples/basic_usage_cpu.py - CPU usage examples

Parameters

Generation Parameters

  • temperature (0.1-1.0): Controls randomness (lower = more focused)
  • top_k (1-100): Limit to top-k most likely tokens
  • top_p (0.1-1.0): Nucleus sampling threshold
  • repetition_penalty (1.0-2.0): Reduce repetitive output
  • max_tokens: Maximum tokens to generate

Recommended Settings

  • Creative writing: temp=0.8-0.9, top_p=0.9, top_k=50
  • Factual content: temp=0.3-0.5, top_p=0.8, top_k=20
  • Code generation: temp=0.2-0.4, top_p=0.7, top_k=10

Performance

CUDA Performance (GPU):

  • Inference Speed: ~50-150+ tokens/sec (depends on GPU)
  • Memory Usage: ~2-4GB VRAM
  • Features: CUDA autocast, torch.compile optimization
  • Latency: ~10-20ms per token

CPU Performance:

  • Inference Speed: ~15-25 tokens/sec
  • Memory Usage: ~2-3GB RAM
  • Features: Multi-threading, CPU-optimized parameters
  • Latency: ~40-65ms per token

General:

  • Context Length: 1024 tokens maximum
  • Model Size: 124M parameters

Acknowledgments

  • Based on Andrej Karpathy's nanoGPT implementation
  • Trained on HuggingFace's FineWeb-edu dataset
  • Uses OpenAI's GPT-2 tokenizer
Downloads last month
-
GGUF
Model size
0.2B params
Architecture
gpt2
Hardware compatibility
Log In to add your hardware

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support