README / README.md
EnricoFermi's picture
Update README.md
1fc876d verified
---
title: continuum-ai
emoji: 🧬
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
---
# continuum-ai
## SOTA models on your iPhone, MacBook, tiny robots, and virtually ANY GPU. No cloud required.
**经ιͺŒε―ε‘‘ζ€§** (Experiential Plasticity) β€” ζ¨‘εž‹ι€šθΏ‡η»ιͺŒε‘‘ι€ θ‡ͺθΊ«ζžΆζž„
We don't quantize. We don't distill. We **structurally reshape** the model's architecture through [Experiential Plasticity](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md) β€” iterative pruning and retraining that makes models smaller AND better. Like biological synaptic pruning during brain development: the connections that fire together wire together, the rest are removed.
**The result: models that were designed for datacenters, running on your phone.**
Built on the incredible open source work of the [Qwen team](https://huggingface.co/Qwen) and the broader open model community. Open weights make this possible β€” we compress and specialize what you generously share.
| What | Proof |
|------|-------|
| **2.6GB code model for iPhone** | [qwen3.5-4b-code-forged-GGUF](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged-GGUF) β€” HumanEval: **63/85 passing (74.1%)**, 70% on hard problems, benchmark still running |
| **Sonnet 4.6-level on MacBook** | [qwen3.5-27b-code-forged-mlx-4bit](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged-mlx-4bit) β€” 15GB, 9 tok/s on M1 32GB |
| **35B MoE in 1.8GB** | [qwen3.5-35b-a3b-compacted-GGUF](https://huggingface.co/continuum-ai/qwen3.5-35b-a3b-compacted-GGUF) β€” 256 experts pruned to 16 |
| **+24% better at code** | [qwen3.5-4b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged) β€” perplexity 3.04 to 2.31 after forging |
We target every device tier. Same technique, different compaction levels. **Be competitive at ANY size.**
### Device Targets
| Device | RAM | Our Model | Size |
|--------|-----|-----------|------|
| RTX 5090 | 32GB | qwen3.5-27b-code-forged (fp16) | 17GB |
| MacBook Pro 32GB | 32GB | qwen3.5-27b-code-forged-mlx-4bit | 15GB |
| RTX 3090 | 24GB | qwen3.5-27b-code-forged (4-bit) | 17GB |
| MacBook Air 16GB | 16GB | qwen3.5-4b-code-forged Q8_0 | 4.2GB |
| iPhone 17 / Android | 8GB | qwen3.5-4b-code-forged Q4_K_M | 2.6GB |
| MacBook Air 8GB | 8GB | qwen3.5-4b-code-forged Q4_K_M | 2.6GB |
| Raspberry Pi 5 | 8GB | qwen3.5-4b-code-forged Q4_K_M | 2.6GB |
| **Roomba j7+** | **8GB** | **qwen3.5-4b-code-forged Q4_K_M** | **2.6GB** |
Yes, really. The iRobot Roomba j7+ has a Qualcomm QCS6490 with 8GB RAM β€” the same memory budget as an iPhone 17. Our 2.6GB Q4_K_M model fits with room to spare. Any ARM SoC with 4GB+ RAM can run these models via [llama.cpp](https://github.com/ggml-org/llama.cpp).
## Published Models
### Qwen3.5 β€” Forged (Code Domain)
| Model | Base | Domain | Improvement | Size | Runs On |
|-------|------|--------|------------|------|---------|
| **[qwen3.5-27b-code-forged-mlx-4bit](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged-mlx-4bit)** | Qwen3.5-27B | Code | +3.5% | **15GB** | **MacBook Pro 32GB (9 tok/s)** |
| [qwen3.5-27b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged) | Qwen3.5-27B | Code | +3.5% | 17GB (4-bit) | RTX 3090/4090/5090 |
| [qwen3.5-27b-code-forged-defragged](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged-defragged) | Qwen3.5-27B | Code | +3.9% | Smaller | RTX 3090/4090/5090 |
| **[qwen3.5-4b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged)** | Qwen3.5-4B | Code | **+26.6%** | 8GB | **Any GPU / MacBook** |
| **[qwen3.5-4b-code-forged-GGUF](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged-GGUF)** | Qwen3.5-4B | Code | **+26.6%** | **2.6GB Q4** | **iPhone 17, MacBook Air 8GB** |
| [qwen3.5-4b-code-forged-defragged](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged-defragged) | Qwen3.5-4B | Code | +33% | Smaller | Any GPU / MacBook |
### Qwen3.5 β€” Compacted (Expert Pruning)
| Model | Original | Method | Reduction | Runs On |
|-------|----------|--------|-----------|---------|
| **[qwen3.5-35b-a3b-compacted](https://huggingface.co/continuum-ai/qwen3.5-35b-a3b-compacted)** | Qwen3.5-35B-A3B (256 experts) | Expert pruning to 16 experts | **49GB to 11GB** | RTX 3090/4090/5090 |
| [qwen3.5-35b-a3b-compacted-GGUF](https://huggingface.co/continuum-ai/qwen3.5-35b-a3b-compacted-GGUF) | Same | GGUF Q2_K/Q4_K_M | **1.8GB / 2.7GB** | iPhone / MacBook Air |
### Qwen2.5 β€” Compacted (Head + Expert Pruning)
| Model | Original | Method | Reduction |
|-------|----------|--------|-----------|
| [qwen2.5-coder-32b-compacted](https://huggingface.co/continuum-ai/qwen2.5-coder-32b-compacted) | Qwen2.5-Coder-32B | Head pruning + mixed quant | 67GB to 14GB |
| [qwen2.5-coder-14b-compacted](https://huggingface.co/continuum-ai/qwen2.5-coder-14b-compacted) | Qwen2.5-Coder-14B | Head pruning + mixed quant | 27GB to 9GB |
### Scaling Law Experiments
| Model | Params | Improvement | Notes |
|-------|--------|------------|-------|
| [qwen2.5-0.5b-general-forged](https://huggingface.co/continuum-ai/qwen2.5-0.5b-general-forged) | 0.5B | -3.2% | Too small β€” already maximally compressed |
| [qwen2.5-1.5b-general-forged](https://huggingface.co/continuum-ai/qwen2.5-1.5b-general-forged) | 1.5B | +2.4% | Improvement begins |
| [qwen2.5-3b-general-forged](https://huggingface.co/continuum-ai/qwen2.5-3b-general-forged) | 3.1B | +0.4% | Marginal on generic text |
Larger models harbor more redundancy, more room for plasticity to improve them. Domain-specific training (code) amplifies the effect dramatically vs generic text.
## Run on MacBook (2 Commands)
```bash
pip install mlx-lm
```
```python
from mlx_lm import load, generate
# Sonnet 4.6-level model, 15GB, runs on any 32GB Mac
model, tokenizer = load("continuum-ai/qwen3.5-27b-code-forged-mlx-4bit")
print(generate(model, tokenizer, prompt="def merge_sort(arr):", max_tokens=200))
```
Works on MacBook Pro, MacBook Air (16GB+ for smaller models), Mac Mini, iMac. ~9 tok/s on M1 32GB. Faster on M2/M3/M4.
## Forge Your Own
Three commands. Any NVIDIA GPU with 8GB+ VRAM.
```bash
git clone https://github.com/CambrianTech/sentinel-ai && cd sentinel-ai && ./setup.sh
source .venv/bin/activate
python scripts/forge_model.py Qwen/Qwen3.5-4B --domain code
```
Auto-detects GPU, picks memory tier (fp16 / 4-bit), trains with LoRA + AMP, prunes, defrags, saves. Observable progress via `status.json`. Works on RTX 3090, 4090, 5090.
## The Science
### Experiential Plasticity
Not compression. **Architectural optimization.** The model's structure co-evolves with its training:
1. Train on domain data (LoRA + AMP mixed precision)
2. Measure each attention head's information contribution
3. Prune heads that don't contribute to the domain
4. Retrain β€” surviving heads specialize and compensate
5. Defrag β€” structurally remove dead heads, free VRAM
6. Repeat β€” each cycle, the model improves
### Transfer Function
Recovery from pruning follows a measurable exponential: `1.45 * exp(-0.18 * cycle) - 0.03`. This connects transformer optimization to classical control theory β€” the same math used in electrical engineering and robotics for decades.
### Continuous Defrag
Traditional pruning masks heads but keeps tensor sizes unchanged. Continuous defrag slices the actual Q/K/V/O weight matrices β€” the model gets physically smaller between cycles:
```
Cycle 1: 27B params, 17.9GB -> prune -> defrag -> freed 1.7GB
Cycle 2: 24.5B, 16.2GB, batch=2 -> prune -> defrag -> freed 1.7GB (2x faster)
Cycle 3: 22B, 14.5GB, batch=3 (2.8x faster)
```
40% faster total training. 33% smaller final model.
### Head Mitosis
Pruning frees slots. Mitosis fills them. When a head is overutilized (high information contribution), it gets cloned into a pruned slot β€” each copy initialized at 50% gate value to maintain output continuity. After continued training, the clones **diverge and specialize**, just like cell differentiation after biological mitosis.
Experimentally: a cloned head diverged within 500 steps, with the clone achieving *higher* utilization than the parent in its new role. The model grows new specialized capacity exactly where it's needed.
### Self-Directed Controller
The `AdaptivePlasticityController` observes the model and makes all decisions β€” pruning ratio, strategy, training budget, stopping criteria. No human hyperparameters needed.
## Papers
- **[Experiential Plasticity](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md)** β€” Scaling law, transfer function, self-directed controller, domain forging, continuous defrag
- **[Neural Plasticity in Transformers](https://github.com/CambrianTech/continuum/blob/main/docs/papers/SENTINEL-AI-NEURAL-PLASTICITY.md)** β€” Foundation paper with cross-architecture results
- **[Plasticity Compaction](https://github.com/CambrianTech/continuum/blob/main/docs/papers/PLASTICITY-COMPACTION-MOE.md)** β€” MoE expert pruning (67GB to 14GB)
## Links
- [sentinel-ai](https://github.com/CambrianTech/sentinel-ai) β€” Open source forge framework (MIT)
- [continuum](https://github.com/CambrianTech/continuum) β€” Distributed AI on consumer hardware
- [@cambrian](https://x.com/joelteply) β€” Updates and demos