Ursa_Minor_Smashed / README.md

Upload folder using huggingface_hub

8691f4b verified 8 months ago

8.43 kB

	# Ursa_Minor_Smashed

	A GPT-2 (124M parameters) model trained from scratch on the FineWeb-edu dataset using a modern training recipe. This model reproduces the GPT-2 small architecture with updated training techniques including Flash Attention, mixed precision training, and improved optimization strategies.

	## Model Details

	### Model Description
	- Developed by: Kaileh57
	- Model type: GPT-2 (Decoder-only Transformer)
	- Language(s): English
	- License: MIT
	- Finetuned from model: Trained from scratch

	### Model Architecture

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Parameters \| 124M \|
	\| Layers \| 12 \|
	\| Hidden Size \| 768 \|
	\| Attention Heads \| 12 \|
	\| Context Length \| 1024 \|
	\| Vocabulary Size \| 50,304 \|
	\| Activation Function \| GELU (tanh approximation) \|
	\| Position Embeddings \| Learned \|
	\| Layer Norm \| Pre-normalization \|
	\| Attention Type \| Multi-head with Flash Attention \|
	\| Weight Tying \| Token embeddings tied with output projection \|

	### Training Details

	- Dataset: FineWeb-edu (10B token sample)
	- Training Regime: Mixed precision (bfloat16)
	- Optimizer: AdamW (β₁=0.9, β₂=0.95, ε=1e-8)
	- Learning Rate: 1.5e-3 with cosine decay
	- Batch Size: 524,288 tokens
	- Training Steps: 19,073 (1 epoch)
	- Warmup Steps: 715
	- Weight Decay: 0.1
	- Gradient Clipping: 1.0
	- Hardware: NVIDIA RTX A6000 (48GB)
	- Training Time: ~2 days

	### Performance

	\| Benchmark \| Score \|
	\|-----------\|-------\|
	\| HellaSwag \| 32.4% \|
	\| Final Loss \| 2.85 \|
	\| Perplexity \| ~17.3 \|

	Note: Scores are from single epoch training. Multi-epoch training reaches ~35% on HellaSwag.

	## Uses

	### Direct Use
	This model can be used for:
	- Text generation
	- Research on efficient training methods
	- Educational purposes to understand GPT architectures
	- Fine-tuning for specific downstream tasks

	### Out-of-Scope Use
	- Production applications requiring high reliability
	- Generation of factual information (model may hallucinate)
	- Any use case requiring larger context than 1024 tokens

	## Quick Start

	### Installation

	#### CPU Installation (Default)
	```bash
	# Clone the repository
	git clone https://github.com/Kaileh57/Ursa_Minor_Smashed.git
	cd Ursa_Minor_Smashed

	# Set up CPU environment
	pip install -r requirements.txt
	# or use the setup script: ./setup.sh
	```

	#### CUDA Installation (For GPU Acceleration)
	If you have a CUDA-capable GPU and want to use GPU acceleration:

	Linux/macOS:
	```bash
	# Use the automated CUDA setup script
	chmod +x setup-cuda.sh
	./setup-cuda.sh
	```

	Windows:
	```batch
	# Use the Windows CUDA setup script
	setup-cuda.bat
	```

	Manual CUDA Installation:
	```bash
	# Create separate CUDA environment
	python -m venv venv-cuda
	source venv-cuda/bin/activate # On Windows: venv-cuda\Scripts\activate

	# Install CUDA requirements
	pip install -r requirements-cuda.txt
	```

	CUDA Requirements:
	- NVIDIA GPU with CUDA Compute Capability 3.5 or higher
	- NVIDIA drivers installed
	- CUDA 11.8 or 12.1 toolkit (optional, PyTorch includes CUDA runtime)
	- At least 4GB GPU memory recommended

	### Basic Usage

	#### Command Line Interface

	CUDA Version (GPU):
	```bash
	# Basic text generation
	python inference_cuda.py --prompt "Hello, I'm a language model" --max-tokens 50

	# Creative writing
	python inference_cuda.py --prompt "Once upon a time" --max-tokens 200 --temperature 0.9

	# More focused output
	python inference_cuda.py --prompt "The key to machine learning is" --max-tokens 100 --temperature 0.7 --top-k 50
	```

	CPU Version:
	```bash
	# Basic text generation
	python inference_cpu.py --prompt "Hello, I'm a language model" --max-tokens 50

	# Creative writing
	python inference_cpu.py --prompt "Once upon a time" --max-tokens 120 --temperature 0.9

	# More focused output
	python inference_cpu.py --prompt "The key to machine learning is" --max-tokens 80 --temperature 0.7 --top-k 30
	```

	#### Python Interface

	CUDA Version:
	```python
	from inference_cuda import generate_direct, load_model_direct

	# Load model once (requires CUDA)
	model = load_model_direct("model_optimized.pt")

	# Generate text with CUDA optimizations
	result = generate_direct(
	model,
	"Hello, I'm a language model",
	max_new_tokens=100, # Higher tokens for GPU
	temperature=0.8,
	top_k=50 # Higher top_k for better quality
	)
	print(result)
	```

	CPU Version:
	```python
	from inference_cpu import generate_direct, load_model_direct

	# Load model once (CPU only)
	model = load_model_direct("model_optimized.pt")

	# Generate text with CPU optimizations
	result = generate_direct(
	model,
	"Hello, I'm a language model",
	max_new_tokens=80, # Lower tokens for CPU efficiency
	temperature=0.8,
	top_k=30 # Lower top_k for CPU efficiency
	)
	print(result)
	```

	#### Chat Interface

	CUDA Version:
	```bash
	# Start CUDA-optimized chat
	python chat_cuda.py
	```

	CPU Version:
	```bash
	# Start CPU-optimized chat
	python chat_cpu.py
	```

	## Training Procedure

	The model was trained using a modern GPT training recipe including:
	- Flash Attention for efficient attention computation
	- Mixed precision training with bfloat16
	- Gradient accumulation to achieve large batch sizes
	- TF32 for faster matrix multiplications
	- Optimized vocabulary size (50,304) for better GPU utilization

	### Training Hyperparameters
	- Learning rate schedule: Cosine decay from 1.5e-3 to 1.5e-4
	- Gradient accumulation steps: Dynamically calculated
	- Mixed precision: bfloat16 with PyTorch autocast

	## Evaluation

	### Testing Data
	Evaluated on:
	- FineWeb-edu validation set
	- HellaSwag benchmark (10,042 examples)

	### Metrics
	- Cross-entropy loss on validation set
	- Accuracy on HellaSwag commonsense reasoning

	## Technical Specifications

	### Compute Infrastructure
	- Hardware: 1x NVIDIA RTX A6000 (48GB VRAM)
	- Software: PyTorch 2.0+, CUDA 12.1, Flash Attention 2

	### Model Initialization
	- Weights initialized with σ=0.02 (scaled by 1/√(2×n_layers) for residual projections)
	- Embeddings initialized with σ=0.02
	- Biases initialized to zero
	- LayerNorm weights initialized to 1.0, biases to 0.0

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{ursa-minor-smashed,
	author = {Kaileh57},
	title = {Ursa Minor Smashed: Efficient GPT-2 Training},
	year = {2024},
	url = {https://github.com/Kaileh57/Ursa_Minor_Smashed}
	}
	```

	## Available Tools

	### Core Scripts
	- `inference_cuda.py` - CUDA-optimized inference script
	- `inference_cpu.py` - CPU-optimized inference script
	- `chat_cuda.py` - CUDA-optimized chat interface
	- `chat_cpu.py` - CPU-optimized chat interface
	- `benchmark_cuda.py` - CUDA performance benchmarking tool
	- `benchmark_cpu.py` - CPU performance benchmarking tool
	- `convert_to_gguf.py` - Convert to GGUF format for llama.cpp

	### Examples
	- `examples/basic_usage_cuda.py` - CUDA usage examples
	- `examples/basic_usage_cpu.py` - CPU usage examples

	### Parameters

	#### Generation Parameters
	- `temperature` (0.1-1.0): Controls randomness (lower = more focused)
	- `top_k` (1-100): Limit to top-k most likely tokens
	- `top_p` (0.1-1.0): Nucleus sampling threshold
	- `repetition_penalty` (1.0-2.0): Reduce repetitive output
	- `max_tokens`: Maximum tokens to generate

	#### Recommended Settings
	- Creative writing: temp=0.8-0.9, top_p=0.9, top_k=50
	- Factual content: temp=0.3-0.5, top_p=0.8, top_k=20
	- Code generation: temp=0.2-0.4, top_p=0.7, top_k=10

	## Performance

	CUDA Performance (GPU):
	- Inference Speed: ~50-150+ tokens/sec (depends on GPU)
	- Memory Usage: ~2-4GB VRAM
	- Features: CUDA autocast, torch.compile optimization
	- Latency: ~10-20ms per token

	CPU Performance:
	- Inference Speed: ~15-25 tokens/sec
	- Memory Usage: ~2-3GB RAM
	- Features: Multi-threading, CPU-optimized parameters
	- Latency: ~40-65ms per token

	General:
	- Context Length: 1024 tokens maximum
	- Model Size: 124M parameters

	## Acknowledgments
	- Based on Andrej Karpathy's nanoGPT implementation
	- Trained on HuggingFace's FineWeb-edu dataset
	- Uses OpenAI's GPT-2 tokenizer