File size: 9,346 Bytes
208158d
3176bee
d0fe5c3
 
 
 
 
4e3b127
208158d
3176bee
95cae01
227ea57
95cae01
a503000
 
3176bee
c3e5420
3176bee
 
a503000
 
3176bee
 
4a6cb2b
3176bee
 
 
 
7b2dcd1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c3e5420
01f9a6e
c3e5420
01f9a6e
 
 
 
 
 
 
 
3176bee
01f9a6e
 
 
 
 
 
3176bee
 
01f9a6e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95cae01
d0fe5c3
 
 
95cae01
01f9a6e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d0fe5c3
95cae01
01f9a6e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2c77d60
 
 
 
 
 
01f9a6e
 
 
 
 
 
 
 
 
95cae01
01f9a6e
94fd199
01f9a6e
 
1fc876d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
title: continuum-ai
emoji: 🧬
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
---

# continuum-ai

## SOTA models on your iPhone, MacBook, tiny robots, and virtually ANY GPU. No cloud required.

**经ιͺŒε―ε‘‘ζ€§** (Experiential Plasticity) β€” ζ¨‘εž‹ι€šθΏ‡η»ιͺŒε‘‘ι€ θ‡ͺθΊ«ζžΆζž„

We don't quantize. We don't distill. We **structurally reshape** the model's architecture through [Experiential Plasticity](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md) β€” iterative pruning and retraining that makes models smaller AND better. Like biological synaptic pruning during brain development: the connections that fire together wire together, the rest are removed.

**The result: models that were designed for datacenters, running on your phone.**

Built on the incredible open source work of the [Qwen team](https://huggingface.co/Qwen) and the broader open model community. Open weights make this possible β€” we compress and specialize what you generously share.

| What | Proof |
|------|-------|
| **2.6GB code model for iPhone** | [qwen3.5-4b-code-forged-GGUF](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged-GGUF) β€” HumanEval: **63/85 passing (74.1%)**, 70% on hard problems, benchmark still running |
| **Sonnet 4.6-level on MacBook** | [qwen3.5-27b-code-forged-mlx-4bit](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged-mlx-4bit) β€” 15GB, 9 tok/s on M1 32GB |
| **35B MoE in 1.8GB** | [qwen3.5-35b-a3b-compacted-GGUF](https://huggingface.co/continuum-ai/qwen3.5-35b-a3b-compacted-GGUF) β€” 256 experts pruned to 16 |
| **+24% better at code** | [qwen3.5-4b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged) β€” perplexity 3.04 to 2.31 after forging |

We target every device tier. Same technique, different compaction levels. **Be competitive at ANY size.**

### Device Targets

| Device | RAM | Our Model | Size |
|--------|-----|-----------|------|
| RTX 5090 | 32GB | qwen3.5-27b-code-forged (fp16) | 17GB |
| MacBook Pro 32GB | 32GB | qwen3.5-27b-code-forged-mlx-4bit | 15GB |
| RTX 3090 | 24GB | qwen3.5-27b-code-forged (4-bit) | 17GB |
| MacBook Air 16GB | 16GB | qwen3.5-4b-code-forged Q8_0 | 4.2GB |
| iPhone 17 / Android | 8GB | qwen3.5-4b-code-forged Q4_K_M | 2.6GB |
| MacBook Air 8GB | 8GB | qwen3.5-4b-code-forged Q4_K_M | 2.6GB |
| Raspberry Pi 5 | 8GB | qwen3.5-4b-code-forged Q4_K_M | 2.6GB |
| **Roomba j7+** | **8GB** | **qwen3.5-4b-code-forged Q4_K_M** | **2.6GB** |

Yes, really. The iRobot Roomba j7+ has a Qualcomm QCS6490 with 8GB RAM β€” the same memory budget as an iPhone 17. Our 2.6GB Q4_K_M model fits with room to spare. Any ARM SoC with 4GB+ RAM can run these models via [llama.cpp](https://github.com/ggml-org/llama.cpp).

## Published Models

### Qwen3.5 β€” Forged (Code Domain)

| Model | Base | Domain | Improvement | Size | Runs On |
|-------|------|--------|------------|------|---------|
| **[qwen3.5-27b-code-forged-mlx-4bit](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged-mlx-4bit)** | Qwen3.5-27B | Code | +3.5% | **15GB** | **MacBook Pro 32GB (9 tok/s)** |
| [qwen3.5-27b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged) | Qwen3.5-27B | Code | +3.5% | 17GB (4-bit) | RTX 3090/4090/5090 |
| [qwen3.5-27b-code-forged-defragged](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged-defragged) | Qwen3.5-27B | Code | +3.9% | Smaller | RTX 3090/4090/5090 |
| **[qwen3.5-4b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged)** | Qwen3.5-4B | Code | **+26.6%** | 8GB | **Any GPU / MacBook** |
| **[qwen3.5-4b-code-forged-GGUF](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged-GGUF)** | Qwen3.5-4B | Code | **+26.6%** | **2.6GB Q4** | **iPhone 17, MacBook Air 8GB** |
| [qwen3.5-4b-code-forged-defragged](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged-defragged) | Qwen3.5-4B | Code | +33% | Smaller | Any GPU / MacBook |

### Qwen3.5 β€” Compacted (Expert Pruning)

| Model | Original | Method | Reduction | Runs On |
|-------|----------|--------|-----------|---------|
| **[qwen3.5-35b-a3b-compacted](https://huggingface.co/continuum-ai/qwen3.5-35b-a3b-compacted)** | Qwen3.5-35B-A3B (256 experts) | Expert pruning to 16 experts | **49GB to 11GB** | RTX 3090/4090/5090 |
| [qwen3.5-35b-a3b-compacted-GGUF](https://huggingface.co/continuum-ai/qwen3.5-35b-a3b-compacted-GGUF) | Same | GGUF Q2_K/Q4_K_M | **1.8GB / 2.7GB** | iPhone / MacBook Air |

### Qwen2.5 β€” Compacted (Head + Expert Pruning)

| Model | Original | Method | Reduction |
|-------|----------|--------|-----------|
| [qwen2.5-coder-32b-compacted](https://huggingface.co/continuum-ai/qwen2.5-coder-32b-compacted) | Qwen2.5-Coder-32B | Head pruning + mixed quant | 67GB to 14GB |
| [qwen2.5-coder-14b-compacted](https://huggingface.co/continuum-ai/qwen2.5-coder-14b-compacted) | Qwen2.5-Coder-14B | Head pruning + mixed quant | 27GB to 9GB |

### Scaling Law Experiments

| Model | Params | Improvement | Notes |
|-------|--------|------------|-------|
| [qwen2.5-0.5b-general-forged](https://huggingface.co/continuum-ai/qwen2.5-0.5b-general-forged) | 0.5B | -3.2% | Too small β€” already maximally compressed |
| [qwen2.5-1.5b-general-forged](https://huggingface.co/continuum-ai/qwen2.5-1.5b-general-forged) | 1.5B | +2.4% | Improvement begins |
| [qwen2.5-3b-general-forged](https://huggingface.co/continuum-ai/qwen2.5-3b-general-forged) | 3.1B | +0.4% | Marginal on generic text |

Larger models harbor more redundancy, more room for plasticity to improve them. Domain-specific training (code) amplifies the effect dramatically vs generic text.

## Run on MacBook (2 Commands)

```bash
pip install mlx-lm
```

```python
from mlx_lm import load, generate

# Sonnet 4.6-level model, 15GB, runs on any 32GB Mac
model, tokenizer = load("continuum-ai/qwen3.5-27b-code-forged-mlx-4bit")
print(generate(model, tokenizer, prompt="def merge_sort(arr):", max_tokens=200))
```

Works on MacBook Pro, MacBook Air (16GB+ for smaller models), Mac Mini, iMac. ~9 tok/s on M1 32GB. Faster on M2/M3/M4.

## Forge Your Own

Three commands. Any NVIDIA GPU with 8GB+ VRAM.

```bash
git clone https://github.com/CambrianTech/sentinel-ai && cd sentinel-ai && ./setup.sh
source .venv/bin/activate
python scripts/forge_model.py Qwen/Qwen3.5-4B --domain code
```

Auto-detects GPU, picks memory tier (fp16 / 4-bit), trains with LoRA + AMP, prunes, defrags, saves. Observable progress via `status.json`. Works on RTX 3090, 4090, 5090.

## The Science

### Experiential Plasticity

Not compression. **Architectural optimization.** The model's structure co-evolves with its training:

1. Train on domain data (LoRA + AMP mixed precision)
2. Measure each attention head's information contribution
3. Prune heads that don't contribute to the domain
4. Retrain β€” surviving heads specialize and compensate
5. Defrag β€” structurally remove dead heads, free VRAM
6. Repeat β€” each cycle, the model improves

### Transfer Function

Recovery from pruning follows a measurable exponential: `1.45 * exp(-0.18 * cycle) - 0.03`. This connects transformer optimization to classical control theory β€” the same math used in electrical engineering and robotics for decades.

### Continuous Defrag

Traditional pruning masks heads but keeps tensor sizes unchanged. Continuous defrag slices the actual Q/K/V/O weight matrices β€” the model gets physically smaller between cycles:

```
Cycle 1: 27B params, 17.9GB -> prune -> defrag -> freed 1.7GB
Cycle 2: 24.5B, 16.2GB, batch=2 -> prune -> defrag -> freed 1.7GB (2x faster)
Cycle 3: 22B, 14.5GB, batch=3                                      (2.8x faster)
```

40% faster total training. 33% smaller final model.

### Head Mitosis

Pruning frees slots. Mitosis fills them. When a head is overutilized (high information contribution), it gets cloned into a pruned slot β€” each copy initialized at 50% gate value to maintain output continuity. After continued training, the clones **diverge and specialize**, just like cell differentiation after biological mitosis.

Experimentally: a cloned head diverged within 500 steps, with the clone achieving *higher* utilization than the parent in its new role. The model grows new specialized capacity exactly where it's needed.

### Self-Directed Controller

The `AdaptivePlasticityController` observes the model and makes all decisions β€” pruning ratio, strategy, training budget, stopping criteria. No human hyperparameters needed.

## Papers

- **[Experiential Plasticity](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md)** β€” Scaling law, transfer function, self-directed controller, domain forging, continuous defrag
- **[Neural Plasticity in Transformers](https://github.com/CambrianTech/continuum/blob/main/docs/papers/SENTINEL-AI-NEURAL-PLASTICITY.md)** β€” Foundation paper with cross-architecture results
- **[Plasticity Compaction](https://github.com/CambrianTech/continuum/blob/main/docs/papers/PLASTICITY-COMPACTION-MOE.md)** β€” MoE expert pruning (67GB to 14GB)

## Links

- [sentinel-ai](https://github.com/CambrianTech/sentinel-ai) β€” Open source forge framework (MIT)
- [continuum](https://github.com/CambrianTech/continuum) β€” Distributed AI on consumer hardware
- [@cambrian](https://x.com/joelteply) β€” Updates and demos