Instructions to use Sculptor-AI/Ursa_Minor_Smashed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Sculptor-AI/Ursa_Minor_Smashed with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Sculptor-AI/Ursa_Minor_Smashed",
	filename="model_optimized_f32.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Sculptor-AI/Ursa_Minor_Smashed with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Sculptor-AI/Ursa_Minor_Smashed:F32
# Run inference directly in the terminal:
llama cli -hf Sculptor-AI/Ursa_Minor_Smashed:F32

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Sculptor-AI/Ursa_Minor_Smashed:F32
# Run inference directly in the terminal:
llama cli -hf Sculptor-AI/Ursa_Minor_Smashed:F32

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Sculptor-AI/Ursa_Minor_Smashed:F32
# Run inference directly in the terminal:
./llama-cli -hf Sculptor-AI/Ursa_Minor_Smashed:F32

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Sculptor-AI/Ursa_Minor_Smashed:F32
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Sculptor-AI/Ursa_Minor_Smashed:F32

Use Docker

docker model run hf.co/Sculptor-AI/Ursa_Minor_Smashed:F32

LM Studio
Jan
Ollama
How to use Sculptor-AI/Ursa_Minor_Smashed with Ollama:
```
ollama run hf.co/Sculptor-AI/Ursa_Minor_Smashed:F32
```

Unsloth Studio

How to use Sculptor-AI/Ursa_Minor_Smashed with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Sculptor-AI/Ursa_Minor_Smashed to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Sculptor-AI/Ursa_Minor_Smashed to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Sculptor-AI/Ursa_Minor_Smashed to start chatting

Atomic Chat new
Docker Model Runner
How to use Sculptor-AI/Ursa_Minor_Smashed with Docker Model Runner:
```
docker model run hf.co/Sculptor-AI/Ursa_Minor_Smashed:F32
```

Lemonade

How to use Sculptor-AI/Ursa_Minor_Smashed with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Sculptor-AI/Ursa_Minor_Smashed:F32

Run and chat with the model

lemonade run user.Ursa_Minor_Smashed-F32

List all available models

lemonade list

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Ursa_Minor_Smashed

A GPT-2 (124M parameters) model trained from scratch on the FineWeb-edu dataset using a modern training recipe. This model reproduces the GPT-2 small architecture with updated training techniques including Flash Attention, mixed precision training, and improved optimization strategies.

Model Details

Model Description

Developed by: Kaileh57
Model type: GPT-2 (Decoder-only Transformer)
Language(s): English
License: MIT
Finetuned from model: Trained from scratch

Model Architecture

Parameter	Value
Parameters	124M
Layers	12
Hidden Size	768
Attention Heads	12
Context Length	1024
Vocabulary Size	50,304
Activation Function	GELU (tanh approximation)
Position Embeddings	Learned
Layer Norm	Pre-normalization
Attention Type	Multi-head with Flash Attention
Weight Tying	Token embeddings tied with output projection

Training Details

Dataset: FineWeb-edu (10B token sample)
Training Regime: Mixed precision (bfloat16)
Optimizer: AdamW (β₁=0.9, β₂=0.95, ε=1e-8)
Learning Rate: 1.5e-3 with cosine decay
Batch Size: 524,288 tokens
Training Steps: 19,073 (1 epoch)
Warmup Steps: 715
Weight Decay: 0.1
Gradient Clipping: 1.0
Hardware: NVIDIA RTX A6000 (48GB)
Training Time: ~2 days

Performance

Benchmark	Score
HellaSwag	32.4%
Final Loss	2.85
Perplexity	~17.3

Note: Scores are from single epoch training. Multi-epoch training reaches ~35% on HellaSwag.

Uses

Direct Use

This model can be used for:

Text generation
Research on efficient training methods
Educational purposes to understand GPT architectures
Fine-tuning for specific downstream tasks

Out-of-Scope Use

Production applications requiring high reliability
Generation of factual information (model may hallucinate)
Any use case requiring larger context than 1024 tokens

Quick Start

Installation

CPU Installation (Default)

# Clone the repository
git clone https://github.com/Kaileh57/Ursa_Minor_Smashed.git
cd Ursa_Minor_Smashed

# Set up CPU environment
pip install -r requirements.txt
# or use the setup script: ./setup.sh

CUDA Installation (For GPU Acceleration)

If you have a CUDA-capable GPU and want to use GPU acceleration:

Linux/macOS:

# Use the automated CUDA setup script
chmod +x setup-cuda.sh
./setup-cuda.sh

Windows:

# Use the Windows CUDA setup script
setup-cuda.bat

Manual CUDA Installation:

# Create separate CUDA environment
python -m venv venv-cuda
source venv-cuda/bin/activate  # On Windows: venv-cuda\Scripts\activate

# Install CUDA requirements
pip install -r requirements-cuda.txt

CUDA Requirements:

NVIDIA GPU with CUDA Compute Capability 3.5 or higher
NVIDIA drivers installed
CUDA 11.8 or 12.1 toolkit (optional, PyTorch includes CUDA runtime)
At least 4GB GPU memory recommended

Basic Usage

Command Line Interface

CUDA Version (GPU):

# Basic text generation
python inference_cuda.py --prompt "Hello, I'm a language model" --max-tokens 50

# Creative writing
python inference_cuda.py --prompt "Once upon a time" --max-tokens 200 --temperature 0.9

# More focused output
python inference_cuda.py --prompt "The key to machine learning is" --max-tokens 100 --temperature 0.7 --top-k 50

CPU Version:

# Basic text generation
python inference_cpu.py --prompt "Hello, I'm a language model" --max-tokens 50

# Creative writing
python inference_cpu.py --prompt "Once upon a time" --max-tokens 120 --temperature 0.9

# More focused output
python inference_cpu.py --prompt "The key to machine learning is" --max-tokens 80 --temperature 0.7 --top-k 30

Python Interface

CUDA Version:

from inference_cuda import generate_direct, load_model_direct

# Load model once (requires CUDA)
model = load_model_direct("model_optimized.pt")

# Generate text with CUDA optimizations
result = generate_direct(
    model, 
    "Hello, I'm a language model", 
    max_new_tokens=100,  # Higher tokens for GPU
    temperature=0.8,
    top_k=50            # Higher top_k for better quality
)
print(result)

CPU Version:

from inference_cpu import generate_direct, load_model_direct

# Load model once (CPU only)
model = load_model_direct("model_optimized.pt")

# Generate text with CPU optimizations
result = generate_direct(
    model, 
    "Hello, I'm a language model", 
    max_new_tokens=80,   # Lower tokens for CPU efficiency
    temperature=0.8,
    top_k=30            # Lower top_k for CPU efficiency
)
print(result)

Chat Interface

CUDA Version:

# Start CUDA-optimized chat
python chat_cuda.py

CPU Version:

# Start CPU-optimized chat
python chat_cpu.py

Training Procedure

The model was trained using a modern GPT training recipe including:

Flash Attention for efficient attention computation
Mixed precision training with bfloat16
Gradient accumulation to achieve large batch sizes
TF32 for faster matrix multiplications
Optimized vocabulary size (50,304) for better GPU utilization

Training Hyperparameters

Learning rate schedule: Cosine decay from 1.5e-3 to 1.5e-4
Gradient accumulation steps: Dynamically calculated
Mixed precision: bfloat16 with PyTorch autocast

Evaluation

Testing Data

Evaluated on:

FineWeb-edu validation set
HellaSwag benchmark (10,042 examples)

Metrics

Cross-entropy loss on validation set
Accuracy on HellaSwag commonsense reasoning

Technical Specifications

Compute Infrastructure

Hardware: 1x NVIDIA RTX A6000 (48GB VRAM)
Software: PyTorch 2.0+, CUDA 12.1, Flash Attention 2

Model Initialization

Weights initialized with σ=0.02 (scaled by 1/√(2×n_layers) for residual projections)
Embeddings initialized with σ=0.02
Biases initialized to zero
LayerNorm weights initialized to 1.0, biases to 0.0

Citation

If you use this model, please cite:

@misc{ursa-minor-smashed,
  author = {Kaileh57},
  title = {Ursa Minor Smashed: Efficient GPT-2 Training},
  year = {2024},
  url = {https://github.com/Kaileh57/Ursa_Minor_Smashed}
}

Available Tools

Core Scripts

inference_cuda.py - CUDA-optimized inference script
inference_cpu.py - CPU-optimized inference script
chat_cuda.py - CUDA-optimized chat interface
chat_cpu.py - CPU-optimized chat interface
benchmark_cuda.py - CUDA performance benchmarking tool
benchmark_cpu.py - CPU performance benchmarking tool
convert_to_gguf.py - Convert to GGUF format for llama.cpp

Examples

examples/basic_usage_cuda.py - CUDA usage examples
examples/basic_usage_cpu.py - CPU usage examples

Parameters

Generation Parameters

temperature (0.1-1.0): Controls randomness (lower = more focused)
top_k (1-100): Limit to top-k most likely tokens
top_p (0.1-1.0): Nucleus sampling threshold
repetition_penalty (1.0-2.0): Reduce repetitive output
max_tokens: Maximum tokens to generate

Recommended Settings

Creative writing: temp=0.8-0.9, top_p=0.9, top_k=50
Factual content: temp=0.3-0.5, top_p=0.8, top_k=20
Code generation: temp=0.2-0.4, top_p=0.7, top_k=10

Performance

CUDA Performance (GPU):

Inference Speed: ~50-150+ tokens/sec (depends on GPU)
Memory Usage: ~2-4GB VRAM
Features: CUDA autocast, torch.compile optimization
Latency: ~10-20ms per token

CPU Performance:

Inference Speed: ~15-25 tokens/sec
Memory Usage: ~2-3GB RAM
Features: Multi-threading, CPU-optimized parameters
Latency: ~40-65ms per token

General:

Context Length: 1024 tokens maximum
Model Size: 124M parameters

Acknowledgments

Based on Andrej Karpathy's nanoGPT implementation
Trained on HuggingFace's FineWeb-edu dataset
Uses OpenAI's GPT-2 tokenizer

Downloads last month: 7

GGUF

Model size

0.2B params

Architecture

gpt2

Hardware compatibility

32-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support