Spaces:
Configuration error
Configuration error
| title: continuum-ai | |
| emoji: 𧬠| |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: static | |
| pinned: false | |
| # continuum-ai | |
| ## SOTA models on your iPhone, MacBook, tiny robots, and virtually ANY GPU. No cloud required. | |
| **η»ιͺε―ε‘ζ§** (Experiential Plasticity) β 樑ειθΏη»ιͺε‘ι θͺθΊ«ζΆζ | |
| We don't quantize. We don't distill. We **structurally reshape** the model's architecture through [Experiential Plasticity](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md) β iterative pruning and retraining that makes models smaller AND better. Like biological synaptic pruning during brain development: the connections that fire together wire together, the rest are removed. | |
| **The result: models that were designed for datacenters, running on your phone.** | |
| Built on the incredible open source work of the [Qwen team](https://huggingface.co/Qwen) and the broader open model community. Open weights make this possible β we compress and specialize what you generously share. | |
| | What | Proof | | |
| |------|-------| | |
| | **2.6GB code model for iPhone** | [qwen3.5-4b-code-forged-GGUF](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged-GGUF) β HumanEval: **63/85 passing (74.1%)**, 70% on hard problems, benchmark still running | | |
| | **Sonnet 4.6-level on MacBook** | [qwen3.5-27b-code-forged-mlx-4bit](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged-mlx-4bit) β 15GB, 9 tok/s on M1 32GB | | |
| | **35B MoE in 1.8GB** | [qwen3.5-35b-a3b-compacted-GGUF](https://huggingface.co/continuum-ai/qwen3.5-35b-a3b-compacted-GGUF) β 256 experts pruned to 16 | | |
| | **+24% better at code** | [qwen3.5-4b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged) β perplexity 3.04 to 2.31 after forging | | |
| We target every device tier. Same technique, different compaction levels. **Be competitive at ANY size.** | |
| ### Device Targets | |
| | Device | RAM | Our Model | Size | | |
| |--------|-----|-----------|------| | |
| | RTX 5090 | 32GB | qwen3.5-27b-code-forged (fp16) | 17GB | | |
| | MacBook Pro 32GB | 32GB | qwen3.5-27b-code-forged-mlx-4bit | 15GB | | |
| | RTX 3090 | 24GB | qwen3.5-27b-code-forged (4-bit) | 17GB | | |
| | MacBook Air 16GB | 16GB | qwen3.5-4b-code-forged Q8_0 | 4.2GB | | |
| | iPhone 17 / Android | 8GB | qwen3.5-4b-code-forged Q4_K_M | 2.6GB | | |
| | MacBook Air 8GB | 8GB | qwen3.5-4b-code-forged Q4_K_M | 2.6GB | | |
| | Raspberry Pi 5 | 8GB | qwen3.5-4b-code-forged Q4_K_M | 2.6GB | | |
| | **Roomba j7+** | **8GB** | **qwen3.5-4b-code-forged Q4_K_M** | **2.6GB** | | |
| Yes, really. The iRobot Roomba j7+ has a Qualcomm QCS6490 with 8GB RAM β the same memory budget as an iPhone 17. Our 2.6GB Q4_K_M model fits with room to spare. Any ARM SoC with 4GB+ RAM can run these models via [llama.cpp](https://github.com/ggml-org/llama.cpp). | |
| ## Published Models | |
| ### Qwen3.5 β Forged (Code Domain) | |
| | Model | Base | Domain | Improvement | Size | Runs On | | |
| |-------|------|--------|------------|------|---------| | |
| | **[qwen3.5-27b-code-forged-mlx-4bit](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged-mlx-4bit)** | Qwen3.5-27B | Code | +3.5% | **15GB** | **MacBook Pro 32GB (9 tok/s)** | | |
| | [qwen3.5-27b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged) | Qwen3.5-27B | Code | +3.5% | 17GB (4-bit) | RTX 3090/4090/5090 | | |
| | [qwen3.5-27b-code-forged-defragged](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged-defragged) | Qwen3.5-27B | Code | +3.9% | Smaller | RTX 3090/4090/5090 | | |
| | **[qwen3.5-4b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged)** | Qwen3.5-4B | Code | **+26.6%** | 8GB | **Any GPU / MacBook** | | |
| | **[qwen3.5-4b-code-forged-GGUF](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged-GGUF)** | Qwen3.5-4B | Code | **+26.6%** | **2.6GB Q4** | **iPhone 17, MacBook Air 8GB** | | |
| | [qwen3.5-4b-code-forged-defragged](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged-defragged) | Qwen3.5-4B | Code | +33% | Smaller | Any GPU / MacBook | | |
| ### Qwen3.5 β Compacted (Expert Pruning) | |
| | Model | Original | Method | Reduction | Runs On | | |
| |-------|----------|--------|-----------|---------| | |
| | **[qwen3.5-35b-a3b-compacted](https://huggingface.co/continuum-ai/qwen3.5-35b-a3b-compacted)** | Qwen3.5-35B-A3B (256 experts) | Expert pruning to 16 experts | **49GB to 11GB** | RTX 3090/4090/5090 | | |
| | [qwen3.5-35b-a3b-compacted-GGUF](https://huggingface.co/continuum-ai/qwen3.5-35b-a3b-compacted-GGUF) | Same | GGUF Q2_K/Q4_K_M | **1.8GB / 2.7GB** | iPhone / MacBook Air | | |
| ### Qwen2.5 β Compacted (Head + Expert Pruning) | |
| | Model | Original | Method | Reduction | | |
| |-------|----------|--------|-----------| | |
| | [qwen2.5-coder-32b-compacted](https://huggingface.co/continuum-ai/qwen2.5-coder-32b-compacted) | Qwen2.5-Coder-32B | Head pruning + mixed quant | 67GB to 14GB | | |
| | [qwen2.5-coder-14b-compacted](https://huggingface.co/continuum-ai/qwen2.5-coder-14b-compacted) | Qwen2.5-Coder-14B | Head pruning + mixed quant | 27GB to 9GB | | |
| ### Scaling Law Experiments | |
| | Model | Params | Improvement | Notes | | |
| |-------|--------|------------|-------| | |
| | [qwen2.5-0.5b-general-forged](https://huggingface.co/continuum-ai/qwen2.5-0.5b-general-forged) | 0.5B | -3.2% | Too small β already maximally compressed | | |
| | [qwen2.5-1.5b-general-forged](https://huggingface.co/continuum-ai/qwen2.5-1.5b-general-forged) | 1.5B | +2.4% | Improvement begins | | |
| | [qwen2.5-3b-general-forged](https://huggingface.co/continuum-ai/qwen2.5-3b-general-forged) | 3.1B | +0.4% | Marginal on generic text | | |
| Larger models harbor more redundancy, more room for plasticity to improve them. Domain-specific training (code) amplifies the effect dramatically vs generic text. | |
| ## Run on MacBook (2 Commands) | |
| ```bash | |
| pip install mlx-lm | |
| ``` | |
| ```python | |
| from mlx_lm import load, generate | |
| # Sonnet 4.6-level model, 15GB, runs on any 32GB Mac | |
| model, tokenizer = load("continuum-ai/qwen3.5-27b-code-forged-mlx-4bit") | |
| print(generate(model, tokenizer, prompt="def merge_sort(arr):", max_tokens=200)) | |
| ``` | |
| Works on MacBook Pro, MacBook Air (16GB+ for smaller models), Mac Mini, iMac. ~9 tok/s on M1 32GB. Faster on M2/M3/M4. | |
| ## Forge Your Own | |
| Three commands. Any NVIDIA GPU with 8GB+ VRAM. | |
| ```bash | |
| git clone https://github.com/CambrianTech/sentinel-ai && cd sentinel-ai && ./setup.sh | |
| source .venv/bin/activate | |
| python scripts/forge_model.py Qwen/Qwen3.5-4B --domain code | |
| ``` | |
| Auto-detects GPU, picks memory tier (fp16 / 4-bit), trains with LoRA + AMP, prunes, defrags, saves. Observable progress via `status.json`. Works on RTX 3090, 4090, 5090. | |
| ## The Science | |
| ### Experiential Plasticity | |
| Not compression. **Architectural optimization.** The model's structure co-evolves with its training: | |
| 1. Train on domain data (LoRA + AMP mixed precision) | |
| 2. Measure each attention head's information contribution | |
| 3. Prune heads that don't contribute to the domain | |
| 4. Retrain β surviving heads specialize and compensate | |
| 5. Defrag β structurally remove dead heads, free VRAM | |
| 6. Repeat β each cycle, the model improves | |
| ### Transfer Function | |
| Recovery from pruning follows a measurable exponential: `1.45 * exp(-0.18 * cycle) - 0.03`. This connects transformer optimization to classical control theory β the same math used in electrical engineering and robotics for decades. | |
| ### Continuous Defrag | |
| Traditional pruning masks heads but keeps tensor sizes unchanged. Continuous defrag slices the actual Q/K/V/O weight matrices β the model gets physically smaller between cycles: | |
| ``` | |
| Cycle 1: 27B params, 17.9GB -> prune -> defrag -> freed 1.7GB | |
| Cycle 2: 24.5B, 16.2GB, batch=2 -> prune -> defrag -> freed 1.7GB (2x faster) | |
| Cycle 3: 22B, 14.5GB, batch=3 (2.8x faster) | |
| ``` | |
| 40% faster total training. 33% smaller final model. | |
| ### Head Mitosis | |
| Pruning frees slots. Mitosis fills them. When a head is overutilized (high information contribution), it gets cloned into a pruned slot β each copy initialized at 50% gate value to maintain output continuity. After continued training, the clones **diverge and specialize**, just like cell differentiation after biological mitosis. | |
| Experimentally: a cloned head diverged within 500 steps, with the clone achieving *higher* utilization than the parent in its new role. The model grows new specialized capacity exactly where it's needed. | |
| ### Self-Directed Controller | |
| The `AdaptivePlasticityController` observes the model and makes all decisions β pruning ratio, strategy, training budget, stopping criteria. No human hyperparameters needed. | |
| ## Papers | |
| - **[Experiential Plasticity](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md)** β Scaling law, transfer function, self-directed controller, domain forging, continuous defrag | |
| - **[Neural Plasticity in Transformers](https://github.com/CambrianTech/continuum/blob/main/docs/papers/SENTINEL-AI-NEURAL-PLASTICITY.md)** β Foundation paper with cross-architecture results | |
| - **[Plasticity Compaction](https://github.com/CambrianTech/continuum/blob/main/docs/papers/PLASTICITY-COMPACTION-MOE.md)** β MoE expert pruning (67GB to 14GB) | |
| ## Links | |
| - [sentinel-ai](https://github.com/CambrianTech/sentinel-ai) β Open source forge framework (MIT) | |
| - [continuum](https://github.com/CambrianTech/continuum) β Distributed AI on consumer hardware | |
| - [@cambrian](https://x.com/joelteply) β Updates and demos | |