Outlier-10B-V2
⚠️ Legacy model — superseded by Outlier-10B-V3.2
This is an early Outlier release. It is kept publicly available for historical reference and reproducibility. New users should prefer the V3.2 model linked above.
What this was
Outlier-10B-V2 was an early Outlier release based on Qwen2.5-7B-Instruct with 7.6B (legacy V2 dense) total parameters. It has been superseded by the V3.2 architecture which adds significant improvements in training methodology and runtime efficiency.
What's new in V3.2
- Zero-delta expert initialization (faster convergence)
- CAKLD distillation training
- Three-tier paged runtime
- Cross-layer expert prefetch
- Alpha-only TTT for personalization
See Outlier-10B-V3.2 for the latest.
Architecture
Outlier uses a shared expert + ternary delta expert architecture:
- Shared expert: The full base model serves as a shared dense expert
- Ternary delta experts: Additional experts stored at 1.58 bits/weight using ternary quantization ({-1, 0, +1})
- Dense-Sparse-Dense (DSD) layer pattern: Alternating dense and sparse layers for efficient compute
- Zero-delta initialization: Experts initialized to zero so training begins from the base model
- Top-2 routing: Each token activates the shared expert plus the top-2 ternary delta experts
- Three-tier paged runtime: GPU → CPU → disk paging for consumer hardware deployment
- Cross-layer expert prefetch: Prefetches next-layer experts during current-layer compute
License
Apache 2.0. The base model (Qwen2.5-7B-Instruct) was created by Alibaba Cloud and is used under its original license terms.
Built by
Matt Kerr · Kerr & Company LLC · Grand Rapids, MI
- Downloads last month
- 675