Spaces:

continuum-ai
/

README

Configuration error

App Files Files Community

README / README.md

EnricoFermi

Update README.md

1fc876d verified 3 days ago

preview code

raw

history blame contribute delete

9.35 kB

metadata

title: continuum-ai
emoji: 🧬
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false

continuum-ai

SOTA models on your iPhone, MacBook, tiny robots, and virtually ANY GPU. No cloud required.

经验可塑性 (Experiential Plasticity) — 模型通过经验塑造自身架构

We don't quantize. We don't distill. We structurally reshape the model's architecture through Experiential Plasticity — iterative pruning and retraining that makes models smaller AND better. Like biological synaptic pruning during brain development: the connections that fire together wire together, the rest are removed.

The result: models that were designed for datacenters, running on your phone.

Built on the incredible open source work of the Qwen team and the broader open model community. Open weights make this possible — we compress and specialize what you generously share.

What	Proof
2.6GB code model for iPhone	qwen3.5-4b-code-forged-GGUF — HumanEval: 63/85 passing (74.1%), 70% on hard problems, benchmark still running
Sonnet 4.6-level on MacBook	qwen3.5-27b-code-forged-mlx-4bit — 15GB, 9 tok/s on M1 32GB
35B MoE in 1.8GB	qwen3.5-35b-a3b-compacted-GGUF — 256 experts pruned to 16
+24% better at code	qwen3.5-4b-code-forged — perplexity 3.04 to 2.31 after forging

We target every device tier. Same technique, different compaction levels. Be competitive at ANY size.

Device Targets

Device	RAM	Our Model	Size
RTX 5090	32GB	qwen3.5-27b-code-forged (fp16)	17GB
MacBook Pro 32GB	32GB	qwen3.5-27b-code-forged-mlx-4bit	15GB
RTX 3090	24GB	qwen3.5-27b-code-forged (4-bit)	17GB
MacBook Air 16GB	16GB	qwen3.5-4b-code-forged Q8_0	4.2GB
iPhone 17 / Android	8GB	qwen3.5-4b-code-forged Q4_K_M	2.6GB
MacBook Air 8GB	8GB	qwen3.5-4b-code-forged Q4_K_M	2.6GB
Raspberry Pi 5	8GB	qwen3.5-4b-code-forged Q4_K_M	2.6GB
Roomba j7+	8GB	qwen3.5-4b-code-forged Q4_K_M	2.6GB

Yes, really. The iRobot Roomba j7+ has a Qualcomm QCS6490 with 8GB RAM — the same memory budget as an iPhone 17. Our 2.6GB Q4_K_M model fits with room to spare. Any ARM SoC with 4GB+ RAM can run these models via llama.cpp.

Published Models

Qwen3.5 — Forged (Code Domain)

Model	Base	Domain	Improvement	Size	Runs On
qwen3.5-27b-code-forged-mlx-4bit	Qwen3.5-27B	Code	+3.5%	15GB	MacBook Pro 32GB (9 tok/s)
qwen3.5-27b-code-forged	Qwen3.5-27B	Code	+3.5%	17GB (4-bit)	RTX 3090/4090/5090
qwen3.5-27b-code-forged-defragged	Qwen3.5-27B	Code	+3.9%	Smaller	RTX 3090/4090/5090
qwen3.5-4b-code-forged	Qwen3.5-4B	Code	+26.6%	8GB	Any GPU / MacBook
qwen3.5-4b-code-forged-GGUF	Qwen3.5-4B	Code	+26.6%	2.6GB Q4	iPhone 17, MacBook Air 8GB
qwen3.5-4b-code-forged-defragged	Qwen3.5-4B	Code	+33%	Smaller	Any GPU / MacBook

Qwen3.5 — Compacted (Expert Pruning)

Model	Original	Method	Reduction	Runs On
qwen3.5-35b-a3b-compacted	Qwen3.5-35B-A3B (256 experts)	Expert pruning to 16 experts	49GB to 11GB	RTX 3090/4090/5090
qwen3.5-35b-a3b-compacted-GGUF	Same	GGUF Q2_K/Q4_K_M	1.8GB / 2.7GB	iPhone / MacBook Air

Qwen2.5 — Compacted (Head + Expert Pruning)

Model	Original	Method	Reduction
qwen2.5-coder-32b-compacted	Qwen2.5-Coder-32B	Head pruning + mixed quant	67GB to 14GB
qwen2.5-coder-14b-compacted	Qwen2.5-Coder-14B	Head pruning + mixed quant	27GB to 9GB

Scaling Law Experiments

Model	Params	Improvement	Notes
qwen2.5-0.5b-general-forged	0.5B	-3.2%	Too small — already maximally compressed
qwen2.5-1.5b-general-forged	1.5B	+2.4%	Improvement begins
qwen2.5-3b-general-forged	3.1B	+0.4%	Marginal on generic text

Larger models harbor more redundancy, more room for plasticity to improve them. Domain-specific training (code) amplifies the effect dramatically vs generic text.

Run on MacBook (2 Commands)

pip install mlx-lm

from mlx_lm import load, generate

# Sonnet 4.6-level model, 15GB, runs on any 32GB Mac
model, tokenizer = load("continuum-ai/qwen3.5-27b-code-forged-mlx-4bit")
print(generate(model, tokenizer, prompt="def merge_sort(arr):", max_tokens=200))

Works on MacBook Pro, MacBook Air (16GB+ for smaller models), Mac Mini, iMac. ~9 tok/s on M1 32GB. Faster on M2/M3/M4.

Forge Your Own

Three commands. Any NVIDIA GPU with 8GB+ VRAM.

git clone https://github.com/CambrianTech/sentinel-ai && cd sentinel-ai && ./setup.sh
source .venv/bin/activate
python scripts/forge_model.py Qwen/Qwen3.5-4B --domain code

Auto-detects GPU, picks memory tier (fp16 / 4-bit), trains with LoRA + AMP, prunes, defrags, saves. Observable progress via status.json. Works on RTX 3090, 4090, 5090.

The Science

Experiential Plasticity

Not compression. Architectural optimization. The model's structure co-evolves with its training:

Train on domain data (LoRA + AMP mixed precision)
Measure each attention head's information contribution
Prune heads that don't contribute to the domain
Retrain — surviving heads specialize and compensate
Defrag — structurally remove dead heads, free VRAM
Repeat — each cycle, the model improves

Transfer Function

Recovery from pruning follows a measurable exponential: 1.45 * exp(-0.18 * cycle) - 0.03. This connects transformer optimization to classical control theory — the same math used in electrical engineering and robotics for decades.

Continuous Defrag

Traditional pruning masks heads but keeps tensor sizes unchanged. Continuous defrag slices the actual Q/K/V/O weight matrices — the model gets physically smaller between cycles:

Cycle 1: 27B params, 17.9GB -> prune -> defrag -> freed 1.7GB
Cycle 2: 24.5B, 16.2GB, batch=2 -> prune -> defrag -> freed 1.7GB (2x faster)
Cycle 3: 22B, 14.5GB, batch=3                                      (2.8x faster)

40% faster total training. 33% smaller final model.

Head Mitosis

Pruning frees slots. Mitosis fills them. When a head is overutilized (high information contribution), it gets cloned into a pruned slot — each copy initialized at 50% gate value to maintain output continuity. After continued training, the clones diverge and specialize, just like cell differentiation after biological mitosis.

Experimentally: a cloned head diverged within 500 steps, with the clone achieving higher utilization than the parent in its new role. The model grows new specialized capacity exactly where it's needed.

Self-Directed Controller

The AdaptivePlasticityController observes the model and makes all decisions — pruning ratio, strategy, training budget, stopping criteria. No human hyperparameters needed.

Papers

Experiential Plasticity — Scaling law, transfer function, self-directed controller, domain forging, continuous defrag
Neural Plasticity in Transformers — Foundation paper with cross-architecture results
Plasticity Compaction — MoE expert pruning (67GB to 14GB)