Instructions to use Sculptor-AI/Ursa_Minor_Smashed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Sculptor-AI/Ursa_Minor_Smashed with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Sculptor-AI/Ursa_Minor_Smashed", filename="model_optimized_f32.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Sculptor-AI/Ursa_Minor_Smashed with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Sculptor-AI/Ursa_Minor_Smashed:F32 # Run inference directly in the terminal: llama-cli -hf Sculptor-AI/Ursa_Minor_Smashed:F32
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Sculptor-AI/Ursa_Minor_Smashed:F32 # Run inference directly in the terminal: llama-cli -hf Sculptor-AI/Ursa_Minor_Smashed:F32
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Sculptor-AI/Ursa_Minor_Smashed:F32 # Run inference directly in the terminal: ./llama-cli -hf Sculptor-AI/Ursa_Minor_Smashed:F32
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Sculptor-AI/Ursa_Minor_Smashed:F32 # Run inference directly in the terminal: ./build/bin/llama-cli -hf Sculptor-AI/Ursa_Minor_Smashed:F32
Use Docker
docker model run hf.co/Sculptor-AI/Ursa_Minor_Smashed:F32
- LM Studio
- Jan
- Ollama
How to use Sculptor-AI/Ursa_Minor_Smashed with Ollama:
ollama run hf.co/Sculptor-AI/Ursa_Minor_Smashed:F32
- Unsloth Studio new
How to use Sculptor-AI/Ursa_Minor_Smashed with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Sculptor-AI/Ursa_Minor_Smashed to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Sculptor-AI/Ursa_Minor_Smashed to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Sculptor-AI/Ursa_Minor_Smashed to start chatting
- Docker Model Runner
How to use Sculptor-AI/Ursa_Minor_Smashed with Docker Model Runner:
docker model run hf.co/Sculptor-AI/Ursa_Minor_Smashed:F32
- Lemonade
How to use Sculptor-AI/Ursa_Minor_Smashed with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Sculptor-AI/Ursa_Minor_Smashed:F32
Run and chat with the model
lemonade run user.Ursa_Minor_Smashed-F32
List all available models
lemonade list
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Sculptor-AI/Ursa_Minor_Smashed:F32# Run inference directly in the terminal:
llama-cli -hf Sculptor-AI/Ursa_Minor_Smashed:F32Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Sculptor-AI/Ursa_Minor_Smashed:F32# Run inference directly in the terminal:
./llama-cli -hf Sculptor-AI/Ursa_Minor_Smashed:F32Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Sculptor-AI/Ursa_Minor_Smashed:F32# Run inference directly in the terminal:
./build/bin/llama-cli -hf Sculptor-AI/Ursa_Minor_Smashed:F32Use Docker
docker model run hf.co/Sculptor-AI/Ursa_Minor_Smashed:F32YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Ursa_Minor_Smashed
A GPT-2 (124M parameters) model trained from scratch on the FineWeb-edu dataset using a modern training recipe. This model reproduces the GPT-2 small architecture with updated training techniques including Flash Attention, mixed precision training, and improved optimization strategies.
Model Details
Model Description
- Developed by: Kaileh57
- Model type: GPT-2 (Decoder-only Transformer)
- Language(s): English
- License: MIT
- Finetuned from model: Trained from scratch
Model Architecture
| Parameter | Value |
|---|---|
| Parameters | 124M |
| Layers | 12 |
| Hidden Size | 768 |
| Attention Heads | 12 |
| Context Length | 1024 |
| Vocabulary Size | 50,304 |
| Activation Function | GELU (tanh approximation) |
| Position Embeddings | Learned |
| Layer Norm | Pre-normalization |
| Attention Type | Multi-head with Flash Attention |
| Weight Tying | Token embeddings tied with output projection |
Training Details
- Dataset: FineWeb-edu (10B token sample)
- Training Regime: Mixed precision (bfloat16)
- Optimizer: AdamW (ฮฒโ=0.9, ฮฒโ=0.95, ฮต=1e-8)
- Learning Rate: 1.5e-3 with cosine decay
- Batch Size: 524,288 tokens
- Training Steps: 19,073 (1 epoch)
- Warmup Steps: 715
- Weight Decay: 0.1
- Gradient Clipping: 1.0
- Hardware: NVIDIA RTX A6000 (48GB)
- Training Time: ~2 days
Performance
| Benchmark | Score |
|---|---|
| HellaSwag | 32.4% |
| Final Loss | 2.85 |
| Perplexity | ~17.3 |
Note: Scores are from single epoch training. Multi-epoch training reaches ~35% on HellaSwag.
Uses
Direct Use
This model can be used for:
- Text generation
- Research on efficient training methods
- Educational purposes to understand GPT architectures
- Fine-tuning for specific downstream tasks
Out-of-Scope Use
- Production applications requiring high reliability
- Generation of factual information (model may hallucinate)
- Any use case requiring larger context than 1024 tokens
Quick Start
Installation
CPU Installation (Default)
# Clone the repository
git clone https://github.com/Kaileh57/Ursa_Minor_Smashed.git
cd Ursa_Minor_Smashed
# Set up CPU environment
pip install -r requirements.txt
# or use the setup script: ./setup.sh
CUDA Installation (For GPU Acceleration)
If you have a CUDA-capable GPU and want to use GPU acceleration:
Linux/macOS:
# Use the automated CUDA setup script
chmod +x setup-cuda.sh
./setup-cuda.sh
Windows:
# Use the Windows CUDA setup script
setup-cuda.bat
Manual CUDA Installation:
# Create separate CUDA environment
python -m venv venv-cuda
source venv-cuda/bin/activate # On Windows: venv-cuda\Scripts\activate
# Install CUDA requirements
pip install -r requirements-cuda.txt
CUDA Requirements:
- NVIDIA GPU with CUDA Compute Capability 3.5 or higher
- NVIDIA drivers installed
- CUDA 11.8 or 12.1 toolkit (optional, PyTorch includes CUDA runtime)
- At least 4GB GPU memory recommended
Basic Usage
Command Line Interface
CUDA Version (GPU):
# Basic text generation
python inference_cuda.py --prompt "Hello, I'm a language model" --max-tokens 50
# Creative writing
python inference_cuda.py --prompt "Once upon a time" --max-tokens 200 --temperature 0.9
# More focused output
python inference_cuda.py --prompt "The key to machine learning is" --max-tokens 100 --temperature 0.7 --top-k 50
CPU Version:
# Basic text generation
python inference_cpu.py --prompt "Hello, I'm a language model" --max-tokens 50
# Creative writing
python inference_cpu.py --prompt "Once upon a time" --max-tokens 120 --temperature 0.9
# More focused output
python inference_cpu.py --prompt "The key to machine learning is" --max-tokens 80 --temperature 0.7 --top-k 30
Python Interface
CUDA Version:
from inference_cuda import generate_direct, load_model_direct
# Load model once (requires CUDA)
model = load_model_direct("model_optimized.pt")
# Generate text with CUDA optimizations
result = generate_direct(
model,
"Hello, I'm a language model",
max_new_tokens=100, # Higher tokens for GPU
temperature=0.8,
top_k=50 # Higher top_k for better quality
)
print(result)
CPU Version:
from inference_cpu import generate_direct, load_model_direct
# Load model once (CPU only)
model = load_model_direct("model_optimized.pt")
# Generate text with CPU optimizations
result = generate_direct(
model,
"Hello, I'm a language model",
max_new_tokens=80, # Lower tokens for CPU efficiency
temperature=0.8,
top_k=30 # Lower top_k for CPU efficiency
)
print(result)
Chat Interface
CUDA Version:
# Start CUDA-optimized chat
python chat_cuda.py
CPU Version:
# Start CPU-optimized chat
python chat_cpu.py
Training Procedure
The model was trained using a modern GPT training recipe including:
- Flash Attention for efficient attention computation
- Mixed precision training with bfloat16
- Gradient accumulation to achieve large batch sizes
- TF32 for faster matrix multiplications
- Optimized vocabulary size (50,304) for better GPU utilization
Training Hyperparameters
- Learning rate schedule: Cosine decay from 1.5e-3 to 1.5e-4
- Gradient accumulation steps: Dynamically calculated
- Mixed precision: bfloat16 with PyTorch autocast
Evaluation
Testing Data
Evaluated on:
- FineWeb-edu validation set
- HellaSwag benchmark (10,042 examples)
Metrics
- Cross-entropy loss on validation set
- Accuracy on HellaSwag commonsense reasoning
Technical Specifications
Compute Infrastructure
- Hardware: 1x NVIDIA RTX A6000 (48GB VRAM)
- Software: PyTorch 2.0+, CUDA 12.1, Flash Attention 2
Model Initialization
- Weights initialized with ฯ=0.02 (scaled by 1/โ(2รn_layers) for residual projections)
- Embeddings initialized with ฯ=0.02
- Biases initialized to zero
- LayerNorm weights initialized to 1.0, biases to 0.0
Citation
If you use this model, please cite:
@misc{ursa-minor-smashed,
author = {Kaileh57},
title = {Ursa Minor Smashed: Efficient GPT-2 Training},
year = {2024},
url = {https://github.com/Kaileh57/Ursa_Minor_Smashed}
}
Available Tools
Core Scripts
inference_cuda.py- CUDA-optimized inference scriptinference_cpu.py- CPU-optimized inference scriptchat_cuda.py- CUDA-optimized chat interfacechat_cpu.py- CPU-optimized chat interfacebenchmark_cuda.py- CUDA performance benchmarking toolbenchmark_cpu.py- CPU performance benchmarking toolconvert_to_gguf.py- Convert to GGUF format for llama.cpp
Examples
examples/basic_usage_cuda.py- CUDA usage examplesexamples/basic_usage_cpu.py- CPU usage examples
Parameters
Generation Parameters
temperature(0.1-1.0): Controls randomness (lower = more focused)top_k(1-100): Limit to top-k most likely tokenstop_p(0.1-1.0): Nucleus sampling thresholdrepetition_penalty(1.0-2.0): Reduce repetitive outputmax_tokens: Maximum tokens to generate
Recommended Settings
- Creative writing: temp=0.8-0.9, top_p=0.9, top_k=50
- Factual content: temp=0.3-0.5, top_p=0.8, top_k=20
- Code generation: temp=0.2-0.4, top_p=0.7, top_k=10
Performance
CUDA Performance (GPU):
- Inference Speed: ~50-150+ tokens/sec (depends on GPU)
- Memory Usage: ~2-4GB VRAM
- Features: CUDA autocast, torch.compile optimization
- Latency: ~10-20ms per token
CPU Performance:
- Inference Speed: ~15-25 tokens/sec
- Memory Usage: ~2-3GB RAM
- Features: Multi-threading, CPU-optimized parameters
- Latency: ~40-65ms per token
General:
- Context Length: 1024 tokens maximum
- Model Size: 124M parameters
Acknowledgments
- Based on Andrej Karpathy's nanoGPT implementation
- Trained on HuggingFace's FineWeb-edu dataset
- Uses OpenAI's GPT-2 tokenizer
- Downloads last month
- 1
32-bit
Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf Sculptor-AI/Ursa_Minor_Smashed:F32# Run inference directly in the terminal: llama-cli -hf Sculptor-AI/Ursa_Minor_Smashed:F32