continuum-ai
/

qwen2.5-coder-7b-compacted

@@ -116,6 +116,41 @@ The Factory configurator lets you design and forge custom models visually — co
 [GitHub](https://github.com/CambrianTech/continuum) · [All Models](https://huggingface.co/continuum-ai) · [Forge-Alloy](https://github.com/CambrianTech/forge-alloy)
 ## License
 apache-2.0

 [GitHub](https://github.com/CambrianTech/continuum) · [All Models](https://huggingface.co/continuum-ai) · [Forge-Alloy](https://github.com/CambrianTech/forge-alloy)
+---
+## More from continuum-ai
+`continuum-ai` ships **structurally compacted models** for hardware tiers nobody else targets. Every artifact is calibration-aware, hardware-anchored, and shipped with [ForgeAlloy](https://github.com/CambrianTech/forge-alloy) cryptographic provenance — the per-problem benchmark JSONLs are uploaded with sha256 hashes recorded in the alloy so anyone can re-score against the same anchor without trusting the producer's claim.
+### Currently shipped
+| Model | Base | HumanEval (vs base) | Tier | What's new |
+|---|---|---|---|---|
+| [**qwen3-coder-30b-a3b-compacted-19b-256k**](https://huggingface.co/continuum-ai/qwen3-coder-30b-a3b-compacted-19b-256k) | Qwen3-Coder-30B-A3B-Instruct | **88.4** (base 92.1, Δ −3.7) | **12 GB Q4_K_M** | First 30B-class coder that fits a 12 GB consumer GPU. Calibration-aware MoE expert pruning (§4.1.3.4). 256K context. |
+| [**qwen2.5-coder-7b-compacted**](https://huggingface.co/continuum-ai/qwen2.5-coder-7b-compacted) | Qwen2.5-Coder-7B | 61.0 (base 62.2, Δ −1.2) | 16 GB fp16 | Methodology validation artifact for §4.1.3.3 (compensation LoRA closes the dense-head pruning gap to within ±3pt of base). |
+### Forge methodology in one paragraph
+A prunable unit's importance MUST be derived from **task-conditioned activation profiling on a held-out corpus** that reflects the artifact's intended workload. Architectural-only metrics (router gate norms, weight norms, magnitudes) are first-pass shortcuts that systematically underperform task-specific activation metrics — empirically validated at two structurally distinct units (dense heads in §4.1.3.1, MoE experts in §4.1.3.4). When the metric is calibration-aware, the surviving subset of heads/experts maps to the workload, and the structural compaction lands close to the unmodified base in held-out benchmarks before any compensation training. When the metric is architectural-only, the surviving subset is task-misaligned and the gap is large enough that compensation LoRA becomes a hard prerequisite. **Get the metric right; the artifact follows.** Full methodology in [PLASTICITY-COMPACTION.md](https://github.com/CambrianTech/continuum/blob/main/docs/papers/PLASTICITY-COMPACTION.md).
+### Roadmap
+The structurally-pruned-MoE quadrant of HuggingFace is **empty for every frontier model**. Quantization is everywhere; structural pruning is nowhere. The next two artifacts target the empty room directly.
+| Target | Base size | Projected | License | Headline |
+|---|---|---|---|---|
+| **Mixtral 8x22B Instruct v0.1** | 141B | ~70B post-prune → ~22 GB Q4_K_M | Apache-2.0 | First single-GPU Mixtral 8x22B (RTX 5090). 2-year-overdue Pareto win on the textbook MoE candidate nobody has ever expert-pruned. |
+| **Qwen3-Coder-480B-A35B-Instruct** | 480B | ~150B post-prune → ~50 GB Q4_K_M | Apache-2.0 | First consumer-accessible 480B-class coder. Single Mac M3 Max 64 GB OR a 2× consumer GPU grid. The grid moonshot — same family as qwen3-coder-30b-a3b, methodology ports directly. |
+**The hard prerequisite for both:** LiveCodeBench v6 anchor extension to `eval_with_calibration.py`. HumanEval is no longer reported on frontier model cards — Qwen3-Coder, DeepSeek-V3.1, and Mixtral 8x22B all use SWE-bench / LiveCodeBench / Aider-Polyglot. Without LCB v6 wired up, frontier targets are blocked at the §4.1.4.1 calibration discipline gate. ~1-2 days of mechanical pipeline work.
+**Compensation LoRA v2 of qwen3-coder-30b-a3b** (the dense-head §4.1.3.3 closure pattern, now applied at the MoE expert level to push 88.4 → projected 90+) is blocked on transformers' `caching_allocator_warmup` pre-allocating an fp16 buffer equal to full model size before bnb 4-bit takes effect, exceeding total VRAM on a single 32 GB GPU. The architecturally correct fix — offline teacher-logit precomputation — is the next sentinel-ai sprint after LCB v6.
+### What we DON'T target
+The Llama 3.3 70B slot is saturated (six publishers, every quant level). We're not shipping a third compacted MoE in the middle tier. The lab's brand pitch is **models that no individual hardware tier can run, made runnable by structural compaction + grid distribution** — frontier headlines, not catalog filler. That's the intersection only continuum has, and it's where the empty room is.
 ## License
 apache-2.0