Instructions to use Jackedupbruh/PruneHeal-13M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Jackedupbruh/PruneHeal-13M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Jackedupbruh/PruneHeal-13M")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Jackedupbruh/PruneHeal-13M", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Jackedupbruh/PruneHeal-13M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Jackedupbruh/PruneHeal-13M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Jackedupbruh/PruneHeal-13M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Jackedupbruh/PruneHeal-13M
- SGLang
How to use Jackedupbruh/PruneHeal-13M with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Jackedupbruh/PruneHeal-13M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Jackedupbruh/PruneHeal-13M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Jackedupbruh/PruneHeal-13M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Jackedupbruh/PruneHeal-13M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Jackedupbruh/PruneHeal-13M with Docker Model Runner:
docker model run hf.co/Jackedupbruh/PruneHeal-13M
PruneHeal-13M
13.2M parameter language model trained with Prune-Heal methodology on a single RTX 3090.
The smallest model you will find with real benchmark scores.
Benchmark Results (lm-evaluation-harness, 0-shot)
| Benchmark | Metric | Score | Random Baseline |
|---|---|---|---|
| PIQA | acc | 55.98% | 50% |
| WinoGrande | acc | 50.28% | 50% |
| BoolQ | acc | 46.02% | 50% |
| ARC-Easy | acc | 32.79% | 25% |
| HellaSwag | acc_norm | 25.22% | 25% |
| ARC-Challenge | acc_norm | 20.73% | 25% |
What is Prune-Heal?
A training method that decouples loss from perplexity. Low loss (accurate predictions) + high perplexity (broad token distributions) = a model that reasons instead of memorizes.
Training Pipeline
- Pretrain on 72M tokens (Wikipedia + TinyStories + Plato)
- Prune โ iterative magnitude pruning removes 37% of weights across 4 cycles
- Heal โ retrain without masks, pruned weights regenerate from gradient signal
- Q&A โ three-phase training (Q&A together, questions, answers) x3 rounds
Key Numbers
- 13,190,784 parameters (13.2M)
- Loss: 2.8 with Perplexity: 21+ (decoupled)
- Training time: ~45 minutes on a single RTX 3090
- VRAM: <2GB
- Training data: 72M tokens (Wikipedia, TinyStories, Plato)
Architecture
Standard LLaMA architecture:
- 6 layers, d_model=192, 6 attention heads
- SwiGLU activation, RMSNorm
- GPT-2 BPE tokenizer (50,257 tokens)
- 256 token context length
- Weight-tied embeddings
The Prune-Heal Insight
Current LLMs chase low perplexity through massive scale. PruneHeal shows that high perplexity maintained alongside low loss is the signature of reasoning rather than memorization.
A model with perplexity 20+ considers 20+ plausible continuations and selects based on context. That is choice. That is the start of reasoning.
The prune-heal cycle achieves this by:
- Pruning disrupts memorized pathways
- Healing allows weights to regenerate into new, more general patterns
- The result: same parameter count, but weights that encode structure instead of sequences
Usage
Hardware
- Single NVIDIA RTX 3090 (24GB VRAM, <2GB used)
- 32GB RAM
- Trained by one person in spare time
Author
James โ Bee Bytez
- Downloads last month
- 3
Evaluation results
- acc on PIQAself-reported0.560
- acc on ARC-Easyself-reported0.328
- acc_norm on HellaSwagself-reported0.252
- acc on WinoGrandeself-reported0.503
- acc on BoolQself-reported0.460
- acc_norm on ARC-Challengeself-reported0.207