EnricoFermi commited on
Commit
00c09bc
·
verified ·
1 Parent(s): 7e5e130

card: rename to qwen2.5-coder-7b-compacted source-aligned + org front-door footer

Browse files
Files changed (1) hide show
  1. README.md +35 -0
README.md CHANGED
@@ -116,6 +116,41 @@ The Factory configurator lets you design and forge custom models visually — co
116
 
117
  [GitHub](https://github.com/CambrianTech/continuum) · [All Models](https://huggingface.co/continuum-ai) · [Forge-Alloy](https://github.com/CambrianTech/forge-alloy)
118
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
  ## License
120
 
121
  apache-2.0
 
116
 
117
  [GitHub](https://github.com/CambrianTech/continuum) · [All Models](https://huggingface.co/continuum-ai) · [Forge-Alloy](https://github.com/CambrianTech/forge-alloy)
118
 
119
+
120
+ ---
121
+
122
+ ## More from continuum-ai
123
+
124
+ `continuum-ai` ships **structurally compacted models** for hardware tiers nobody else targets. Every artifact is calibration-aware, hardware-anchored, and shipped with [ForgeAlloy](https://github.com/CambrianTech/forge-alloy) cryptographic provenance — the per-problem benchmark JSONLs are uploaded with sha256 hashes recorded in the alloy so anyone can re-score against the same anchor without trusting the producer's claim.
125
+
126
+ ### Currently shipped
127
+
128
+ | Model | Base | HumanEval (vs base) | Tier | What's new |
129
+ |---|---|---|---|---|
130
+ | [**qwen3-coder-30b-a3b-compacted-19b-256k**](https://huggingface.co/continuum-ai/qwen3-coder-30b-a3b-compacted-19b-256k) | Qwen3-Coder-30B-A3B-Instruct | **88.4** (base 92.1, Δ −3.7) | **12 GB Q4_K_M** | First 30B-class coder that fits a 12 GB consumer GPU. Calibration-aware MoE expert pruning (§4.1.3.4). 256K context. |
131
+ | [**qwen2.5-coder-7b-compacted**](https://huggingface.co/continuum-ai/qwen2.5-coder-7b-compacted) | Qwen2.5-Coder-7B | 61.0 (base 62.2, Δ −1.2) | 16 GB fp16 | Methodology validation artifact for §4.1.3.3 (compensation LoRA closes the dense-head pruning gap to within ±3pt of base). |
132
+
133
+ ### Forge methodology in one paragraph
134
+
135
+ A prunable unit's importance MUST be derived from **task-conditioned activation profiling on a held-out corpus** that reflects the artifact's intended workload. Architectural-only metrics (router gate norms, weight norms, magnitudes) are first-pass shortcuts that systematically underperform task-specific activation metrics — empirically validated at two structurally distinct units (dense heads in §4.1.3.1, MoE experts in §4.1.3.4). When the metric is calibration-aware, the surviving subset of heads/experts maps to the workload, and the structural compaction lands close to the unmodified base in held-out benchmarks before any compensation training. When the metric is architectural-only, the surviving subset is task-misaligned and the gap is large enough that compensation LoRA becomes a hard prerequisite. **Get the metric right; the artifact follows.** Full methodology in [PLASTICITY-COMPACTION.md](https://github.com/CambrianTech/continuum/blob/main/docs/papers/PLASTICITY-COMPACTION.md).
136
+
137
+ ### Roadmap
138
+
139
+ The structurally-pruned-MoE quadrant of HuggingFace is **empty for every frontier model**. Quantization is everywhere; structural pruning is nowhere. The next two artifacts target the empty room directly.
140
+
141
+ | Target | Base size | Projected | License | Headline |
142
+ |---|---|---|---|---|
143
+ | **Mixtral 8x22B Instruct v0.1** | 141B | ~70B post-prune → ~22 GB Q4_K_M | Apache-2.0 | First single-GPU Mixtral 8x22B (RTX 5090). 2-year-overdue Pareto win on the textbook MoE candidate nobody has ever expert-pruned. |
144
+ | **Qwen3-Coder-480B-A35B-Instruct** | 480B | ~150B post-prune → ~50 GB Q4_K_M | Apache-2.0 | First consumer-accessible 480B-class coder. Single Mac M3 Max 64 GB OR a 2× consumer GPU grid. The grid moonshot — same family as qwen3-coder-30b-a3b, methodology ports directly. |
145
+
146
+ **The hard prerequisite for both:** LiveCodeBench v6 anchor extension to `eval_with_calibration.py`. HumanEval is no longer reported on frontier model cards — Qwen3-Coder, DeepSeek-V3.1, and Mixtral 8x22B all use SWE-bench / LiveCodeBench / Aider-Polyglot. Without LCB v6 wired up, frontier targets are blocked at the §4.1.4.1 calibration discipline gate. ~1-2 days of mechanical pipeline work.
147
+
148
+ **Compensation LoRA v2 of qwen3-coder-30b-a3b** (the dense-head §4.1.3.3 closure pattern, now applied at the MoE expert level to push 88.4 → projected 90+) is blocked on transformers' `caching_allocator_warmup` pre-allocating an fp16 buffer equal to full model size before bnb 4-bit takes effect, exceeding total VRAM on a single 32 GB GPU. The architecturally correct fix — offline teacher-logit precomputation — is the next sentinel-ai sprint after LCB v6.
149
+
150
+ ### What we DON'T target
151
+
152
+ The Llama 3.3 70B slot is saturated (six publishers, every quant level). We're not shipping a third compacted MoE in the middle tier. The lab's brand pitch is **models that no individual hardware tier can run, made runnable by structural compaction + grid distribution** — frontier headlines, not catalog filler. That's the intersection only continuum has, and it's where the empty room is.
153
+
154
  ## License
155
 
156
  apache-2.0