Text Generation
MLX
Safetensors
Rust
qwen2
7b
agentic-coding
android
apple-silicon
attested
bash
c
chain-of-custody
chinese
code
code-completion
code-generation
code-infill
compacted
compensation-lora
consumer-gpu
cpp
cryptographically-verified
css
distillation
edge-inference
efficient
embedded
english
forge-alloy
function-calling
general
general-purpose
go
head-pruning
html
iphone
java
javascript
knowledge-distillation
kotlin
llama-cpp
lm-studio
local-inference
lora
macbook
mobile
multilingual
ollama
on-device
optimized
php
pruned
python
qwen
qwen-coder
qwen2.5
qwen2.5-coder
raspberry-pi
reproducible
ruby
sql
swift
teacher-student
typescript
validation-artifact
versatile
conversational
card: rename to qwen2.5-coder-7b-compacted source-aligned + org front-door footer
Browse files
README.md
CHANGED
|
@@ -116,6 +116,41 @@ The Factory configurator lets you design and forge custom models visually — co
|
|
| 116 |
|
| 117 |
[GitHub](https://github.com/CambrianTech/continuum) · [All Models](https://huggingface.co/continuum-ai) · [Forge-Alloy](https://github.com/CambrianTech/forge-alloy)
|
| 118 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
## License
|
| 120 |
|
| 121 |
apache-2.0
|
|
|
|
| 116 |
|
| 117 |
[GitHub](https://github.com/CambrianTech/continuum) · [All Models](https://huggingface.co/continuum-ai) · [Forge-Alloy](https://github.com/CambrianTech/forge-alloy)
|
| 118 |
|
| 119 |
+
|
| 120 |
+
---
|
| 121 |
+
|
| 122 |
+
## More from continuum-ai
|
| 123 |
+
|
| 124 |
+
`continuum-ai` ships **structurally compacted models** for hardware tiers nobody else targets. Every artifact is calibration-aware, hardware-anchored, and shipped with [ForgeAlloy](https://github.com/CambrianTech/forge-alloy) cryptographic provenance — the per-problem benchmark JSONLs are uploaded with sha256 hashes recorded in the alloy so anyone can re-score against the same anchor without trusting the producer's claim.
|
| 125 |
+
|
| 126 |
+
### Currently shipped
|
| 127 |
+
|
| 128 |
+
| Model | Base | HumanEval (vs base) | Tier | What's new |
|
| 129 |
+
|---|---|---|---|---|
|
| 130 |
+
| [**qwen3-coder-30b-a3b-compacted-19b-256k**](https://huggingface.co/continuum-ai/qwen3-coder-30b-a3b-compacted-19b-256k) | Qwen3-Coder-30B-A3B-Instruct | **88.4** (base 92.1, Δ −3.7) | **12 GB Q4_K_M** | First 30B-class coder that fits a 12 GB consumer GPU. Calibration-aware MoE expert pruning (§4.1.3.4). 256K context. |
|
| 131 |
+
| [**qwen2.5-coder-7b-compacted**](https://huggingface.co/continuum-ai/qwen2.5-coder-7b-compacted) | Qwen2.5-Coder-7B | 61.0 (base 62.2, Δ −1.2) | 16 GB fp16 | Methodology validation artifact for §4.1.3.3 (compensation LoRA closes the dense-head pruning gap to within ±3pt of base). |
|
| 132 |
+
|
| 133 |
+
### Forge methodology in one paragraph
|
| 134 |
+
|
| 135 |
+
A prunable unit's importance MUST be derived from **task-conditioned activation profiling on a held-out corpus** that reflects the artifact's intended workload. Architectural-only metrics (router gate norms, weight norms, magnitudes) are first-pass shortcuts that systematically underperform task-specific activation metrics — empirically validated at two structurally distinct units (dense heads in §4.1.3.1, MoE experts in §4.1.3.4). When the metric is calibration-aware, the surviving subset of heads/experts maps to the workload, and the structural compaction lands close to the unmodified base in held-out benchmarks before any compensation training. When the metric is architectural-only, the surviving subset is task-misaligned and the gap is large enough that compensation LoRA becomes a hard prerequisite. **Get the metric right; the artifact follows.** Full methodology in [PLASTICITY-COMPACTION.md](https://github.com/CambrianTech/continuum/blob/main/docs/papers/PLASTICITY-COMPACTION.md).
|
| 136 |
+
|
| 137 |
+
### Roadmap
|
| 138 |
+
|
| 139 |
+
The structurally-pruned-MoE quadrant of HuggingFace is **empty for every frontier model**. Quantization is everywhere; structural pruning is nowhere. The next two artifacts target the empty room directly.
|
| 140 |
+
|
| 141 |
+
| Target | Base size | Projected | License | Headline |
|
| 142 |
+
|---|---|---|---|---|
|
| 143 |
+
| **Mixtral 8x22B Instruct v0.1** | 141B | ~70B post-prune → ~22 GB Q4_K_M | Apache-2.0 | First single-GPU Mixtral 8x22B (RTX 5090). 2-year-overdue Pareto win on the textbook MoE candidate nobody has ever expert-pruned. |
|
| 144 |
+
| **Qwen3-Coder-480B-A35B-Instruct** | 480B | ~150B post-prune → ~50 GB Q4_K_M | Apache-2.0 | First consumer-accessible 480B-class coder. Single Mac M3 Max 64 GB OR a 2× consumer GPU grid. The grid moonshot — same family as qwen3-coder-30b-a3b, methodology ports directly. |
|
| 145 |
+
|
| 146 |
+
**The hard prerequisite for both:** LiveCodeBench v6 anchor extension to `eval_with_calibration.py`. HumanEval is no longer reported on frontier model cards — Qwen3-Coder, DeepSeek-V3.1, and Mixtral 8x22B all use SWE-bench / LiveCodeBench / Aider-Polyglot. Without LCB v6 wired up, frontier targets are blocked at the §4.1.4.1 calibration discipline gate. ~1-2 days of mechanical pipeline work.
|
| 147 |
+
|
| 148 |
+
**Compensation LoRA v2 of qwen3-coder-30b-a3b** (the dense-head §4.1.3.3 closure pattern, now applied at the MoE expert level to push 88.4 → projected 90+) is blocked on transformers' `caching_allocator_warmup` pre-allocating an fp16 buffer equal to full model size before bnb 4-bit takes effect, exceeding total VRAM on a single 32 GB GPU. The architecturally correct fix — offline teacher-logit precomputation — is the next sentinel-ai sprint after LCB v6.
|
| 149 |
+
|
| 150 |
+
### What we DON'T target
|
| 151 |
+
|
| 152 |
+
The Llama 3.3 70B slot is saturated (six publishers, every quant level). We're not shipping a third compacted MoE in the middle tier. The lab's brand pitch is **models that no individual hardware tier can run, made runnable by structural compaction + grid distribution** — frontier headlines, not catalog filler. That's the intersection only continuum has, and it's where the empty room is.
|
| 153 |
+
|
| 154 |
## License
|
| 155 |
|
| 156 |
apache-2.0
|