Spaces:
Configuration error
Configuration error
File size: 9,346 Bytes
208158d 3176bee d0fe5c3 4e3b127 208158d 3176bee 95cae01 227ea57 95cae01 a503000 3176bee c3e5420 3176bee a503000 3176bee 4a6cb2b 3176bee 7b2dcd1 c3e5420 01f9a6e c3e5420 01f9a6e 3176bee 01f9a6e 3176bee 01f9a6e 95cae01 d0fe5c3 95cae01 01f9a6e d0fe5c3 95cae01 01f9a6e 2c77d60 01f9a6e 95cae01 01f9a6e 94fd199 01f9a6e 1fc876d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 | ---
title: continuum-ai
emoji: π§¬
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
---
# continuum-ai
## SOTA models on your iPhone, MacBook, tiny robots, and virtually ANY GPU. No cloud required.
**η»ιͺε―ε‘ζ§** (Experiential Plasticity) β 樑ειθΏη»ιͺε‘ι θͺθΊ«ζΆζ
We don't quantize. We don't distill. We **structurally reshape** the model's architecture through [Experiential Plasticity](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md) β iterative pruning and retraining that makes models smaller AND better. Like biological synaptic pruning during brain development: the connections that fire together wire together, the rest are removed.
**The result: models that were designed for datacenters, running on your phone.**
Built on the incredible open source work of the [Qwen team](https://huggingface.co/Qwen) and the broader open model community. Open weights make this possible β we compress and specialize what you generously share.
| What | Proof |
|------|-------|
| **2.6GB code model for iPhone** | [qwen3.5-4b-code-forged-GGUF](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged-GGUF) β HumanEval: **63/85 passing (74.1%)**, 70% on hard problems, benchmark still running |
| **Sonnet 4.6-level on MacBook** | [qwen3.5-27b-code-forged-mlx-4bit](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged-mlx-4bit) β 15GB, 9 tok/s on M1 32GB |
| **35B MoE in 1.8GB** | [qwen3.5-35b-a3b-compacted-GGUF](https://huggingface.co/continuum-ai/qwen3.5-35b-a3b-compacted-GGUF) β 256 experts pruned to 16 |
| **+24% better at code** | [qwen3.5-4b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged) β perplexity 3.04 to 2.31 after forging |
We target every device tier. Same technique, different compaction levels. **Be competitive at ANY size.**
### Device Targets
| Device | RAM | Our Model | Size |
|--------|-----|-----------|------|
| RTX 5090 | 32GB | qwen3.5-27b-code-forged (fp16) | 17GB |
| MacBook Pro 32GB | 32GB | qwen3.5-27b-code-forged-mlx-4bit | 15GB |
| RTX 3090 | 24GB | qwen3.5-27b-code-forged (4-bit) | 17GB |
| MacBook Air 16GB | 16GB | qwen3.5-4b-code-forged Q8_0 | 4.2GB |
| iPhone 17 / Android | 8GB | qwen3.5-4b-code-forged Q4_K_M | 2.6GB |
| MacBook Air 8GB | 8GB | qwen3.5-4b-code-forged Q4_K_M | 2.6GB |
| Raspberry Pi 5 | 8GB | qwen3.5-4b-code-forged Q4_K_M | 2.6GB |
| **Roomba j7+** | **8GB** | **qwen3.5-4b-code-forged Q4_K_M** | **2.6GB** |
Yes, really. The iRobot Roomba j7+ has a Qualcomm QCS6490 with 8GB RAM β the same memory budget as an iPhone 17. Our 2.6GB Q4_K_M model fits with room to spare. Any ARM SoC with 4GB+ RAM can run these models via [llama.cpp](https://github.com/ggml-org/llama.cpp).
## Published Models
### Qwen3.5 β Forged (Code Domain)
| Model | Base | Domain | Improvement | Size | Runs On |
|-------|------|--------|------------|------|---------|
| **[qwen3.5-27b-code-forged-mlx-4bit](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged-mlx-4bit)** | Qwen3.5-27B | Code | +3.5% | **15GB** | **MacBook Pro 32GB (9 tok/s)** |
| [qwen3.5-27b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged) | Qwen3.5-27B | Code | +3.5% | 17GB (4-bit) | RTX 3090/4090/5090 |
| [qwen3.5-27b-code-forged-defragged](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged-defragged) | Qwen3.5-27B | Code | +3.9% | Smaller | RTX 3090/4090/5090 |
| **[qwen3.5-4b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged)** | Qwen3.5-4B | Code | **+26.6%** | 8GB | **Any GPU / MacBook** |
| **[qwen3.5-4b-code-forged-GGUF](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged-GGUF)** | Qwen3.5-4B | Code | **+26.6%** | **2.6GB Q4** | **iPhone 17, MacBook Air 8GB** |
| [qwen3.5-4b-code-forged-defragged](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged-defragged) | Qwen3.5-4B | Code | +33% | Smaller | Any GPU / MacBook |
### Qwen3.5 β Compacted (Expert Pruning)
| Model | Original | Method | Reduction | Runs On |
|-------|----------|--------|-----------|---------|
| **[qwen3.5-35b-a3b-compacted](https://huggingface.co/continuum-ai/qwen3.5-35b-a3b-compacted)** | Qwen3.5-35B-A3B (256 experts) | Expert pruning to 16 experts | **49GB to 11GB** | RTX 3090/4090/5090 |
| [qwen3.5-35b-a3b-compacted-GGUF](https://huggingface.co/continuum-ai/qwen3.5-35b-a3b-compacted-GGUF) | Same | GGUF Q2_K/Q4_K_M | **1.8GB / 2.7GB** | iPhone / MacBook Air |
### Qwen2.5 β Compacted (Head + Expert Pruning)
| Model | Original | Method | Reduction |
|-------|----------|--------|-----------|
| [qwen2.5-coder-32b-compacted](https://huggingface.co/continuum-ai/qwen2.5-coder-32b-compacted) | Qwen2.5-Coder-32B | Head pruning + mixed quant | 67GB to 14GB |
| [qwen2.5-coder-14b-compacted](https://huggingface.co/continuum-ai/qwen2.5-coder-14b-compacted) | Qwen2.5-Coder-14B | Head pruning + mixed quant | 27GB to 9GB |
### Scaling Law Experiments
| Model | Params | Improvement | Notes |
|-------|--------|------------|-------|
| [qwen2.5-0.5b-general-forged](https://huggingface.co/continuum-ai/qwen2.5-0.5b-general-forged) | 0.5B | -3.2% | Too small β already maximally compressed |
| [qwen2.5-1.5b-general-forged](https://huggingface.co/continuum-ai/qwen2.5-1.5b-general-forged) | 1.5B | +2.4% | Improvement begins |
| [qwen2.5-3b-general-forged](https://huggingface.co/continuum-ai/qwen2.5-3b-general-forged) | 3.1B | +0.4% | Marginal on generic text |
Larger models harbor more redundancy, more room for plasticity to improve them. Domain-specific training (code) amplifies the effect dramatically vs generic text.
## Run on MacBook (2 Commands)
```bash
pip install mlx-lm
```
```python
from mlx_lm import load, generate
# Sonnet 4.6-level model, 15GB, runs on any 32GB Mac
model, tokenizer = load("continuum-ai/qwen3.5-27b-code-forged-mlx-4bit")
print(generate(model, tokenizer, prompt="def merge_sort(arr):", max_tokens=200))
```
Works on MacBook Pro, MacBook Air (16GB+ for smaller models), Mac Mini, iMac. ~9 tok/s on M1 32GB. Faster on M2/M3/M4.
## Forge Your Own
Three commands. Any NVIDIA GPU with 8GB+ VRAM.
```bash
git clone https://github.com/CambrianTech/sentinel-ai && cd sentinel-ai && ./setup.sh
source .venv/bin/activate
python scripts/forge_model.py Qwen/Qwen3.5-4B --domain code
```
Auto-detects GPU, picks memory tier (fp16 / 4-bit), trains with LoRA + AMP, prunes, defrags, saves. Observable progress via `status.json`. Works on RTX 3090, 4090, 5090.
## The Science
### Experiential Plasticity
Not compression. **Architectural optimization.** The model's structure co-evolves with its training:
1. Train on domain data (LoRA + AMP mixed precision)
2. Measure each attention head's information contribution
3. Prune heads that don't contribute to the domain
4. Retrain β surviving heads specialize and compensate
5. Defrag β structurally remove dead heads, free VRAM
6. Repeat β each cycle, the model improves
### Transfer Function
Recovery from pruning follows a measurable exponential: `1.45 * exp(-0.18 * cycle) - 0.03`. This connects transformer optimization to classical control theory β the same math used in electrical engineering and robotics for decades.
### Continuous Defrag
Traditional pruning masks heads but keeps tensor sizes unchanged. Continuous defrag slices the actual Q/K/V/O weight matrices β the model gets physically smaller between cycles:
```
Cycle 1: 27B params, 17.9GB -> prune -> defrag -> freed 1.7GB
Cycle 2: 24.5B, 16.2GB, batch=2 -> prune -> defrag -> freed 1.7GB (2x faster)
Cycle 3: 22B, 14.5GB, batch=3 (2.8x faster)
```
40% faster total training. 33% smaller final model.
### Head Mitosis
Pruning frees slots. Mitosis fills them. When a head is overutilized (high information contribution), it gets cloned into a pruned slot β each copy initialized at 50% gate value to maintain output continuity. After continued training, the clones **diverge and specialize**, just like cell differentiation after biological mitosis.
Experimentally: a cloned head diverged within 500 steps, with the clone achieving *higher* utilization than the parent in its new role. The model grows new specialized capacity exactly where it's needed.
### Self-Directed Controller
The `AdaptivePlasticityController` observes the model and makes all decisions β pruning ratio, strategy, training budget, stopping criteria. No human hyperparameters needed.
## Papers
- **[Experiential Plasticity](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md)** β Scaling law, transfer function, self-directed controller, domain forging, continuous defrag
- **[Neural Plasticity in Transformers](https://github.com/CambrianTech/continuum/blob/main/docs/papers/SENTINEL-AI-NEURAL-PLASTICITY.md)** β Foundation paper with cross-architecture results
- **[Plasticity Compaction](https://github.com/CambrianTech/continuum/blob/main/docs/papers/PLASTICITY-COMPACTION-MOE.md)** β MoE expert pruning (67GB to 14GB)
## Links
- [sentinel-ai](https://github.com/CambrianTech/sentinel-ai) β Open source forge framework (MIT)
- [continuum](https://github.com/CambrianTech/continuum) β Distributed AI on consumer hardware
- [@cambrian](https://x.com/joelteply) β Updates and demos
|