README / README.md
EnricoFermi's picture
Update README.md
1fc876d verified
metadata
title: continuum-ai
emoji: 🧬
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false

continuum-ai

SOTA models on your iPhone, MacBook, tiny robots, and virtually ANY GPU. No cloud required.

经ιͺŒε―ε‘‘ζ€§ (Experiential Plasticity) β€” ζ¨‘εž‹ι€šθΏ‡η»ιͺŒε‘‘ι€ θ‡ͺθΊ«ζžΆζž„

We don't quantize. We don't distill. We structurally reshape the model's architecture through Experiential Plasticity β€” iterative pruning and retraining that makes models smaller AND better. Like biological synaptic pruning during brain development: the connections that fire together wire together, the rest are removed.

The result: models that were designed for datacenters, running on your phone.

Built on the incredible open source work of the Qwen team and the broader open model community. Open weights make this possible β€” we compress and specialize what you generously share.

What Proof
2.6GB code model for iPhone qwen3.5-4b-code-forged-GGUF β€” HumanEval: 63/85 passing (74.1%), 70% on hard problems, benchmark still running
Sonnet 4.6-level on MacBook qwen3.5-27b-code-forged-mlx-4bit β€” 15GB, 9 tok/s on M1 32GB
35B MoE in 1.8GB qwen3.5-35b-a3b-compacted-GGUF β€” 256 experts pruned to 16
+24% better at code qwen3.5-4b-code-forged β€” perplexity 3.04 to 2.31 after forging

We target every device tier. Same technique, different compaction levels. Be competitive at ANY size.

Device Targets

Device RAM Our Model Size
RTX 5090 32GB qwen3.5-27b-code-forged (fp16) 17GB
MacBook Pro 32GB 32GB qwen3.5-27b-code-forged-mlx-4bit 15GB
RTX 3090 24GB qwen3.5-27b-code-forged (4-bit) 17GB
MacBook Air 16GB 16GB qwen3.5-4b-code-forged Q8_0 4.2GB
iPhone 17 / Android 8GB qwen3.5-4b-code-forged Q4_K_M 2.6GB
MacBook Air 8GB 8GB qwen3.5-4b-code-forged Q4_K_M 2.6GB
Raspberry Pi 5 8GB qwen3.5-4b-code-forged Q4_K_M 2.6GB
Roomba j7+ 8GB qwen3.5-4b-code-forged Q4_K_M 2.6GB

Yes, really. The iRobot Roomba j7+ has a Qualcomm QCS6490 with 8GB RAM β€” the same memory budget as an iPhone 17. Our 2.6GB Q4_K_M model fits with room to spare. Any ARM SoC with 4GB+ RAM can run these models via llama.cpp.

Published Models

Qwen3.5 β€” Forged (Code Domain)

Model Base Domain Improvement Size Runs On
qwen3.5-27b-code-forged-mlx-4bit Qwen3.5-27B Code +3.5% 15GB MacBook Pro 32GB (9 tok/s)
qwen3.5-27b-code-forged Qwen3.5-27B Code +3.5% 17GB (4-bit) RTX 3090/4090/5090
qwen3.5-27b-code-forged-defragged Qwen3.5-27B Code +3.9% Smaller RTX 3090/4090/5090
qwen3.5-4b-code-forged Qwen3.5-4B Code +26.6% 8GB Any GPU / MacBook
qwen3.5-4b-code-forged-GGUF Qwen3.5-4B Code +26.6% 2.6GB Q4 iPhone 17, MacBook Air 8GB
qwen3.5-4b-code-forged-defragged Qwen3.5-4B Code +33% Smaller Any GPU / MacBook

Qwen3.5 β€” Compacted (Expert Pruning)

Model Original Method Reduction Runs On
qwen3.5-35b-a3b-compacted Qwen3.5-35B-A3B (256 experts) Expert pruning to 16 experts 49GB to 11GB RTX 3090/4090/5090
qwen3.5-35b-a3b-compacted-GGUF Same GGUF Q2_K/Q4_K_M 1.8GB / 2.7GB iPhone / MacBook Air

Qwen2.5 β€” Compacted (Head + Expert Pruning)

Model Original Method Reduction
qwen2.5-coder-32b-compacted Qwen2.5-Coder-32B Head pruning + mixed quant 67GB to 14GB
qwen2.5-coder-14b-compacted Qwen2.5-Coder-14B Head pruning + mixed quant 27GB to 9GB

Scaling Law Experiments

Model Params Improvement Notes
qwen2.5-0.5b-general-forged 0.5B -3.2% Too small β€” already maximally compressed
qwen2.5-1.5b-general-forged 1.5B +2.4% Improvement begins
qwen2.5-3b-general-forged 3.1B +0.4% Marginal on generic text

Larger models harbor more redundancy, more room for plasticity to improve them. Domain-specific training (code) amplifies the effect dramatically vs generic text.

Run on MacBook (2 Commands)

pip install mlx-lm
from mlx_lm import load, generate

# Sonnet 4.6-level model, 15GB, runs on any 32GB Mac
model, tokenizer = load("continuum-ai/qwen3.5-27b-code-forged-mlx-4bit")
print(generate(model, tokenizer, prompt="def merge_sort(arr):", max_tokens=200))

Works on MacBook Pro, MacBook Air (16GB+ for smaller models), Mac Mini, iMac. ~9 tok/s on M1 32GB. Faster on M2/M3/M4.

Forge Your Own

Three commands. Any NVIDIA GPU with 8GB+ VRAM.

git clone https://github.com/CambrianTech/sentinel-ai && cd sentinel-ai && ./setup.sh
source .venv/bin/activate
python scripts/forge_model.py Qwen/Qwen3.5-4B --domain code

Auto-detects GPU, picks memory tier (fp16 / 4-bit), trains with LoRA + AMP, prunes, defrags, saves. Observable progress via status.json. Works on RTX 3090, 4090, 5090.

The Science

Experiential Plasticity

Not compression. Architectural optimization. The model's structure co-evolves with its training:

  1. Train on domain data (LoRA + AMP mixed precision)
  2. Measure each attention head's information contribution
  3. Prune heads that don't contribute to the domain
  4. Retrain β€” surviving heads specialize and compensate
  5. Defrag β€” structurally remove dead heads, free VRAM
  6. Repeat β€” each cycle, the model improves

Transfer Function

Recovery from pruning follows a measurable exponential: 1.45 * exp(-0.18 * cycle) - 0.03. This connects transformer optimization to classical control theory β€” the same math used in electrical engineering and robotics for decades.

Continuous Defrag

Traditional pruning masks heads but keeps tensor sizes unchanged. Continuous defrag slices the actual Q/K/V/O weight matrices β€” the model gets physically smaller between cycles:

Cycle 1: 27B params, 17.9GB -> prune -> defrag -> freed 1.7GB
Cycle 2: 24.5B, 16.2GB, batch=2 -> prune -> defrag -> freed 1.7GB (2x faster)
Cycle 3: 22B, 14.5GB, batch=3                                      (2.8x faster)

40% faster total training. 33% smaller final model.

Head Mitosis

Pruning frees slots. Mitosis fills them. When a head is overutilized (high information contribution), it gets cloned into a pruned slot β€” each copy initialized at 50% gate value to maintain output continuity. After continued training, the clones diverge and specialize, just like cell differentiation after biological mitosis.

Experimentally: a cloned head diverged within 500 steps, with the clone achieving higher utilization than the parent in its new role. The model grows new specialized capacity exactly where it's needed.

Self-Directed Controller

The AdaptivePlasticityController observes the model and makes all decisions β€” pruning ratio, strategy, training budget, stopping criteria. No human hyperparameters needed.

Papers

Links