PruneHeal-13M

13.2M parameter language model trained with Prune-Heal methodology on a single RTX 3090.

The smallest model you will find with real benchmark scores.

Benchmark Results (lm-evaluation-harness, 0-shot)

Benchmark	Metric	Score	Random Baseline
PIQA	acc	55.98%	50%
WinoGrande	acc	50.28%	50%
BoolQ	acc	46.02%	50%
ARC-Easy	acc	32.79%	25%
HellaSwag	acc_norm	25.22%	25%
ARC-Challenge	acc_norm	20.73%	25%

What is Prune-Heal?

A training method that decouples loss from perplexity. Low loss (accurate predictions) + high perplexity (broad token distributions) = a model that reasons instead of memorizes.

Training Pipeline

Pretrain on 72M tokens (Wikipedia + TinyStories + Plato)
Prune — iterative magnitude pruning removes 37% of weights across 4 cycles
Heal — retrain without masks, pruned weights regenerate from gradient signal
Q&A — three-phase training (Q&A together, questions, answers) x3 rounds

Key Numbers

13,190,784 parameters (13.2M)
Loss: 2.8 with Perplexity: 21+ (decoupled)
Training time: ~45 minutes on a single RTX 3090
VRAM: <2GB
Training data: 72M tokens (Wikipedia, TinyStories, Plato)

Architecture

Standard LLaMA architecture:

6 layers, d_model=192, 6 attention heads
SwiGLU activation, RMSNorm
GPT-2 BPE tokenizer (50,257 tokens)
256 token context length
Weight-tied embeddings

The Prune-Heal Insight

Current LLMs chase low perplexity through massive scale. PruneHeal shows that high perplexity maintained alongside low loss is the signature of reasoning rather than memorization.

A model with perplexity 20+ considers 20+ plausible continuations and selects based on context. That is choice. That is the start of reasoning.

The prune-heal cycle achieves this by:

Pruning disrupts memorized pathways
Healing allows weights to regenerate into new, more general patterns
The result: same parameter count, but weights that encode structure instead of sequences

Usage

Hardware

Single NVIDIA RTX 3090 (24GB VRAM, <2GB used)
32GB RAM
Trained by one person in spare time

Author

James — Bee Bytez

Downloads last month: 3

Safetensors

Model size

13.2M params

Tensor type

F32

Evaluation results

acc on PIQA
self-reported

0.560
acc on ARC-Easy
self-reported

0.328
acc_norm on HellaSwag
self-reported

0.252
acc on WinoGrande
self-reported

0.503
acc on BoolQ
self-reported

0.460
acc_norm on ARC-Challenge
self-reported

0.207