EnricoFermi commited on
Commit
f8e8bd4
Β·
verified Β·
1 Parent(s): 00c09bc

card: refresh footer with comprehensive forge queue (10 architectures)

Browse files
Files changed (1) hide show
  1. README.md +39 -13
README.md CHANGED
@@ -116,40 +116,66 @@ The Factory configurator lets you design and forge custom models visually β€” co
116
 
117
  [GitHub](https://github.com/CambrianTech/continuum) Β· [All Models](https://huggingface.co/continuum-ai) Β· [Forge-Alloy](https://github.com/CambrianTech/forge-alloy)
118
 
119
-
120
  ---
121
 
122
  ## More from continuum-ai
123
 
124
- `continuum-ai` ships **structurally compacted models** for hardware tiers nobody else targets. Every artifact is calibration-aware, hardware-anchored, and shipped with [ForgeAlloy](https://github.com/CambrianTech/forge-alloy) cryptographic provenance β€” the per-problem benchmark JSONLs are uploaded with sha256 hashes recorded in the alloy so anyone can re-score against the same anchor without trusting the producer's claim.
125
 
126
  ### Currently shipped
127
 
128
  | Model | Base | HumanEval (vs base) | Tier | What's new |
129
  |---|---|---|---|---|
130
  | [**qwen3-coder-30b-a3b-compacted-19b-256k**](https://huggingface.co/continuum-ai/qwen3-coder-30b-a3b-compacted-19b-256k) | Qwen3-Coder-30B-A3B-Instruct | **88.4** (base 92.1, Ξ” βˆ’3.7) | **12 GB Q4_K_M** | First 30B-class coder that fits a 12 GB consumer GPU. Calibration-aware MoE expert pruning (Β§4.1.3.4). 256K context. |
131
- | [**qwen2.5-coder-7b-compacted**](https://huggingface.co/continuum-ai/qwen2.5-coder-7b-compacted) | Qwen2.5-Coder-7B | 61.0 (base 62.2, Ξ” βˆ’1.2) | 16 GB fp16 | Methodology validation artifact for Β§4.1.3.3 (compensation LoRA closes the dense-head pruning gap to within Β±3pt of base). |
132
 
133
  ### Forge methodology in one paragraph
134
 
135
- A prunable unit's importance MUST be derived from **task-conditioned activation profiling on a held-out corpus** that reflects the artifact's intended workload. Architectural-only metrics (router gate norms, weight norms, magnitudes) are first-pass shortcuts that systematically underperform task-specific activation metrics β€” empirically validated at two structurally distinct units (dense heads in Β§4.1.3.1, MoE experts in Β§4.1.3.4). When the metric is calibration-aware, the surviving subset of heads/experts maps to the workload, and the structural compaction lands close to the unmodified base in held-out benchmarks before any compensation training. When the metric is architectural-only, the surviving subset is task-misaligned and the gap is large enough that compensation LoRA becomes a hard prerequisite. **Get the metric right; the artifact follows.** Full methodology in [PLASTICITY-COMPACTION.md](https://github.com/CambrianTech/continuum/blob/main/docs/papers/PLASTICITY-COMPACTION.md).
136
 
137
- ### Roadmap
138
 
139
- The structurally-pruned-MoE quadrant of HuggingFace is **empty for every frontier model**. Quantization is everywhere; structural pruning is nowhere. The next two artifacts target the empty room directly.
140
 
141
- | Target | Base size | Projected | License | Headline |
142
- |---|---|---|---|---|
143
- | **Mixtral 8x22B Instruct v0.1** | 141B | ~70B post-prune β†’ ~22 GB Q4_K_M | Apache-2.0 | First single-GPU Mixtral 8x22B (RTX 5090). 2-year-overdue Pareto win on the textbook MoE candidate nobody has ever expert-pruned. |
144
- | **Qwen3-Coder-480B-A35B-Instruct** | 480B | ~150B post-prune β†’ ~50 GB Q4_K_M | Apache-2.0 | First consumer-accessible 480B-class coder. Single Mac M3 Max 64 GB OR a 2Γ— consumer GPU grid. The grid moonshot β€” same family as qwen3-coder-30b-a3b, methodology ports directly. |
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
- **The hard prerequisite for both:** LiveCodeBench v6 anchor extension to `eval_with_calibration.py`. HumanEval is no longer reported on frontier model cards β€” Qwen3-Coder, DeepSeek-V3.1, and Mixtral 8x22B all use SWE-bench / LiveCodeBench / Aider-Polyglot. Without LCB v6 wired up, frontier targets are blocked at the Β§4.1.4.1 calibration discipline gate. ~1-2 days of mechanical pipeline work.
147
 
148
- **Compensation LoRA v2 of qwen3-coder-30b-a3b** (the dense-head Β§4.1.3.3 closure pattern, now applied at the MoE expert level to push 88.4 β†’ projected 90+) is blocked on transformers' `caching_allocator_warmup` pre-allocating an fp16 buffer equal to full model size before bnb 4-bit takes effect, exceeding total VRAM on a single 32 GB GPU. The architecturally correct fix β€” offline teacher-logit precomputation β€” is the next sentinel-ai sprint after LCB v6.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
 
150
  ### What we DON'T target
151
 
152
- The Llama 3.3 70B slot is saturated (six publishers, every quant level). We're not shipping a third compacted MoE in the middle tier. The lab's brand pitch is **models that no individual hardware tier can run, made runnable by structural compaction + grid distribution** β€” frontier headlines, not catalog filler. That's the intersection only continuum has, and it's where the empty room is.
 
153
 
154
  ## License
155
 
 
116
 
117
  [GitHub](https://github.com/CambrianTech/continuum) Β· [All Models](https://huggingface.co/continuum-ai) Β· [Forge-Alloy](https://github.com/CambrianTech/forge-alloy)
118
 
 
119
  ---
120
 
121
  ## More from continuum-ai
122
 
123
+ `continuum-ai` ships **structurally compacted models for hardware tiers nobody else targets**. Every artifact is calibration-aware, hardware-anchored, and shipped with [ForgeAlloy](https://github.com/CambrianTech/forge-alloy) cryptographic provenance β€” the per-problem benchmark JSONLs are uploaded with sha256 hashes recorded in the alloy so anyone can re-score against the same anchor without trusting the producer's claim.
124
 
125
  ### Currently shipped
126
 
127
  | Model | Base | HumanEval (vs base) | Tier | What's new |
128
  |---|---|---|---|---|
129
  | [**qwen3-coder-30b-a3b-compacted-19b-256k**](https://huggingface.co/continuum-ai/qwen3-coder-30b-a3b-compacted-19b-256k) | Qwen3-Coder-30B-A3B-Instruct | **88.4** (base 92.1, Ξ” βˆ’3.7) | **12 GB Q4_K_M** | First 30B-class coder that fits a 12 GB consumer GPU. Calibration-aware MoE expert pruning (Β§4.1.3.4). 256K context. |
130
+ | [**qwen2.5-coder-7b-compacted**](https://huggingface.co/continuum-ai/qwen2.5-coder-7b-compacted) | Qwen2.5-Coder-7B | 61.0 (base 62.2, Ξ” βˆ’1.2) | 16 GB fp16 | Methodology validation artifact for Β§4.1.3.3 β€” compensation LoRA closes the dense-head pruning gap to within Β±3pt of base. |
131
 
132
  ### Forge methodology in one paragraph
133
 
134
+ A prunable unit's importance MUST be derived from **task-conditioned activation profiling on a held-out corpus** that reflects the artifact's intended workload. Architectural-only metrics (router gate norms, weight norms, magnitudes) are first-pass shortcuts that systematically underperform task-specific activation metrics β€” empirically validated at two structurally distinct units (dense heads in Β§4.1.3.1, MoE experts in Β§4.1.3.4) with a +9.7 HumanEval swing on the same prune budget. **Get the metric right; the artifact follows.** Full methodology in [PLASTICITY-COMPACTION.md](https://github.com/CambrianTech/continuum/blob/main/docs/papers/PLASTICITY-COMPACTION.md).
135
 
136
+ ### The empty-quadrant frontier
137
 
138
+ A live HuggingFace audit (April 2026) confirmed that **the entire structurally-pruned-MoE quadrant is empty for every frontier model except Llama 3.3 70B**. Quantization is everywhere; structural pruning is nowhere. The forge methodology validated on `qwen3-coder-30b-a3b` ports directly to every other MoE family. The forge queue below is the comprehensive map of empty quadrants we are claiming, one architecture at a time.
139
 
140
+ ### Forge queue β€” comprehensive new-architecture coverage
141
+
142
+ | # | Target | Arch | License | Total/Active | Tier post-prune | Status |
143
+ |---|---|---|---|---|---|---|
144
+ | 1 | [allenai/OLMoE-1B-7B-0924-Instruct](https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct) | `OlmoeForCausalLM` | Apache-2.0 | 7B/1.3B (64e/top-8) | Phone / 4 GB | **Downloading now.** Smallest serious MoE on HF, fully-open (data + checkpoints), zero pruned variants. |
145
+ | 2 | [ibm-granite/granite-3.1-3b-a800m-instruct](https://huggingface.co/ibm-granite/granite-3.1-3b-a800m-instruct) | `GraniteMoeForCausalLM` | Apache-2.0 | 3.3B/800M (40e/top-8) | Edge tier | **Downloading now.** IBM enterprise brand, ultra-rare tiny-MoE niche, zero pruned variants. |
146
+ | 3 | [deepseek-ai/DeepSeek-V2-Lite-Chat](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite-Chat) | `DeepseekV2ForCausalLM` | DeepSeek (commercial OK) | 15.7B/2.4B | Single GPU | **Downloading now.** The forgotten DeepSeek sibling β€” DeepSeek brand without 670 GB of VRAM. |
147
+ | 4 | [microsoft/Phi-3.5-MoE-instruct](https://huggingface.co/microsoft/Phi-3.5-MoE-instruct) | `PhiMoEForCausalLM` | **MIT** | 42B/6.6B (16e/top-2) | Single 5090 Q4 | Queued. MIT-licensed Microsoft MoE that nobody runs because 42B is the awkward middle tier β€” until you prune to 12 experts. |
148
+ | 5 | [mistralai/Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1) | `MixtralForCausalLM` | Apache-2.0 | 141B/39B (8e/top-2) | Single 5090 Q4 | Queued. Two-year overdue Pareto win β€” the textbook MoE that nobody has ever calibration-pruned. |
149
+ | 6 | [Qwen/Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) | `Qwen3MoeForCausalLM` | Apache-2.0 | 235B/22B (128e/top-8) | Single 5090 Q4 | Queued. Same family as our shipped 30B-A3B β†’ methodology ports trivially. |
150
+ | 7 | [Qwen/Qwen3-Coder-480B-A35B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct) | `Qwen3MoeForCausalLM` | Apache-2.0 | 480B/35B (160e/top-8) | **Grid moonshot** (4Γ—24GB) | Queued. First consumer-accessible 480B coder. |
151
+ | 8 | [deepseek-ai/DeepSeek-Coder-V2-Instruct](https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct) | `DeepseekV2ForCausalLM` | DeepSeek | 236B/21B | Grid | Queued. Direct methodology replay at higher tier. |
152
+ | 9 | [Snowflake/snowflake-arctic-instruct](https://huggingface.co/Snowflake/snowflake-arctic-instruct) | `ArcticForCausalLM` | Apache-2.0 | 480B/17B (128e/top-2) | Grid | Queued. The forgotten Apache frontier MoE β€” dense+sparse hybrid arch is a novel research contribution by itself. |
153
+ | 10 | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | `DeepseekV3ForCausalLM` | **MIT** | 671B/37B | **Grid moonshot** | Queued. The viral king. First non-distill R1 compaction. |
154
+
155
+ **8 distinct architecture classes** covered across **5 hardware tiers** (edge β†’ phone β†’ single GPU β†’ 5090 β†’ grid). When the queue completes, the calibration-aware-importance metric has been validated on `Qwen3MoeForCausalLM`, `OlmoeForCausalLM`, `GraniteMoeForCausalLM`, `DeepseekV2ForCausalLM`, `PhiMoEForCausalLM`, `MixtralForCausalLM`, `ArcticForCausalLM`, and `DeepseekV3ForCausalLM` β€” the cross-family invariance claim becomes empirical, not theoretical.
156
 
157
+ ### Hard prerequisites being built in parallel
158
 
159
+ - **LiveCodeBench v6 anchor extension** for `eval_with_calibration.py` β€” HumanEval is no longer reported on frontier model cards (Qwen3-Coder, DeepSeek-V3.1, Mixtral 8x22B all use SWE-bench / LiveCodeBench / Aider-Polyglot). Without LCB v6 wired up, frontier targets are blocked at the Β§4.1.4.1 calibration discipline gate. ~1-2 days of mechanical pipeline work.
160
+ - **Offline teacher-logit precomputation** for `compensation_lora.py` β€” at 30B+ class, transformers' `caching_allocator_warmup` pre-allocates an fp16 buffer equal to full model size before bnb 4-bit takes effect, exceeding total VRAM on a single 32 GB GPU. The architecturally correct fix is phase-1-load-teacher / phase-2-unload / phase-3-load-student-and-train-against-on-disk-logits. Prerequisite for compensation v2 of every artifact β‰₯30B.
161
+ - **Grid expert sharding** for the 480B+ moonshots β€” `cpu_expert_prune_v2.py`'s streaming pruner already handles shards bigger than any single GPU, but distributed inference + cross-machine activation profiling for the calibration-aware metric needs the grid layer. This is the Β§4.1.3.5 distributed forge methodology paper section.
162
+
163
+ ### Sensory bridge stack (separate from the LLM forge queue)
164
+
165
+ For Continuum's own sensory architecture (vision/audio/embedding bridges), the right targets are not forge candidates β€” they're curated bridge components used as-is:
166
+
167
+ | Component | Model | Use |
168
+ |---|---|---|
169
+ | Vision encoder | [`google/siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384) | Image embeddings for the vision bridge |
170
+ | Vision describer | [`microsoft/Phi-3.5-vision-instruct`](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) | Small VLM that generates text descriptions consumed by text-only LLMs |
171
+ | STT | [`openai/whisper-large-v3`](https://huggingface.co/openai/whisper-large-v3) | Speech transcription for audio bridge |
172
+ | Multilingual embedding | [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3) | Sensory cache embeddings |
173
+ | Avatar diffusion | [`black-forest-labs/FLUX.1-schnell`](https://huggingface.co/black-forest-labs/FLUX.1-schnell) | Apache-licensed avatar generation for Continuum universes |
174
 
175
  ### What we DON'T target
176
 
177
+ The Llama 3.3 70B slot is saturated (six publishers, every quant level). We're not shipping a third compacted MoE in the middle tier. The lab's brand pitch is **models that no individual hardware tier can run, made runnable by structural compaction + grid distribution** β€” empty-quadrant headlines, not catalog filler. That's the intersection only continuum has, and the forge queue above is the map.
178
+
179
 
180
  ## License
181