Outlier
AI & ML interests
None defined yet.
Recent Activity
Outlier
Ternary Mixture-of-Experts language models on consumer hardware.
Outlier is an open-source ternary-quantized Mixture-of-Experts runtime and a small family of models trained by one founder on a single-developer pipeline. Routed experts are stored in {-1, 0, +1} ternary at ~1.6 bits per weight. A frozen full-precision Qwen2.5 base acts as the shared expert — our models are overlays that attach to an unmodified base, not standalone checkpoints. Top-2 routing per MoE layer. Apache 2.0.
Built solo in 14 days on a Mac Studio M1 Ultra plus spot B200 GPUs. Total compute spend under $1,000. Three U.S. provisional patents filed.
Honest benchmark status
Every number here has a sample size, a stderr, a harness version, and a status. When a number is preliminary, we say so. When we can't reproduce it from a saved source file, we say that too.
Harness: lm-evaluation-harness v0.4.9.1 (150B, 70B V3.3) · v0.4.11 (40B V3.2). MMLU 5-shot, bfloat16, full sample n = 14,042 unless noted.
| Model | MMLU (acc) | Stderr | n | Harness | Status |
|---|---|---|---|---|---|
| Outlier-150B V3.2 | 84.46% | 0.29% | 14,042 | 0.4.9.1 | Verified — day 13 re-measurement |
| Outlier-70B V3.3 (alpha-fixed) | 83.10% | 0.30% | 14,042 | 0.4.9.1 | Verified — day 13 full run |
| Outlier-40B V3.2 | 77.80% | 0.33% | 14,042 | 0.4.11 | Verified — day 12 full run |
| Outlier-10B V3.2 | 76.19% | — | limited | 0.4.11 | Unverified — source file is a smoke test, re-running |
Two caveats we're documenting publicly:
10B is not defended. Its day 12 source file used
--limit 570per subtask and does not meet our standard for a headline number. A full-sample re-run is scheduled for the current cluster sprint. The 76.19% is left on the table as a placeholder, clearly flagged.Harness version drift. The 150B re-measurement landed 1.30pp higher on v0.4.9.1 than an earlier v0.4.11 run on the same weights. We don't yet know whether the drift is systematic. We're locking v0.4.9.1 as our harness going forward and documenting both numbers in our ground-truth file so reviewers can reproduce either.
For context: Llama 3.1 70B lands around 83.1% MMLU on full sample. Outlier-70B V3.3 alpha-fixed is in that neighborhood, on a model family trained solo on consumer hardware plus spot GPUs for under $1,000 total compute spend.
What's running right now
As of Day 14 (April 14, 2026), a 2×B200 cluster sprint is executing in parallel:
- Alpha-fix reruns on 10B, 40B, and 150B (the technique that gave 70B V3.3 its +1.61pp recovery)
- Full secondary benchmarks on all scales: GSM8K, HellaSwag, ARC-C, ARC-E, Winogrande, TruthfulQA, HumanEval, MMLU-Pro
- Speed experiments: EAGLE3 speculative decoding, SWIFT self-speculation, dead-expert pruning, FP4 Blackwell inference, verified paged-runtime tok/s benchmarks
- Long context: LongRoPE swap targeting 256K on 70B and 150B (replacing the current YaRN 4x config that reaches 128K)
- Safety tier: Llama Guard 3 wrapper integration, red team evaluation, DPO safety fine-tune on 70B Instruct
No verified-before-published numbers appear on this page. New numbers land in the changelog when their source files land on disk.
What actually exists today
- Open-source engine at github.com/Outlier-host/outlier — Apache 2.0. Ternary MoE loader, three-tier paged cache, MPS + CPU backends, lm-eval compatible, alpha overlay loader for post-training recovery.
- First-party inference verified on Apple Silicon. 10B in paged mode on a 64 GB Mac Studio M1 Ultra. Non-paged 10B runs at ~13.5 tok/s on the same hardware. Full 70B V3.3 paged-runtime tok/s is being benched in the current sprint.
- GPU-resident expert dequantization. A patched modeling file materializes ternary experts to bf16 at load time, ~56× speedup vs the original CPU→GPU path on a single B200. Port to 70B and 40B in progress.
- Alpha-fix technique. 280 per-expert scalar gates trained in 18 minutes on one B200 recovered +1.61pp MMLU on 70B where a 68M-parameter LoRA fine-tune regressed. Overlay file is 15 KB — 250,000× fewer trainable parameters than the LoRA approach it outperformed. We believe this is a novel post-training recovery primitive for quantized MoE; a prior art search is running.
- Three U.S. provisional patents filed: #64/026,886 (April 3) · #64/030,368 (April 6) · #64/034,028 (April 9). A fourth covering the alpha-fix technique is under novelty review.
What we're not claiming
- We do not match Kimi K2.5, GLM-5, Claude Opus 4.6, Gemini 3 Pro, or GPT-5 on pure MMLU.
- We are not the first ternary MoE. Microsoft + Apple's MoTE (arXiv:2506.14435, June 2025) published a shared-FP + ternary-expert architecture for vision-language models. Our contribution is the text-LLM variant, the overlay-on-frozen-base deployment artifact, and the alpha-fix recovery primitive.
- We are not shipping models trained on trillions of tokens. Our distillation pipeline uses DeepSeek V3 as teacher and touches a fraction of the tokens a Llama-class pretrain does. The comparison we care about is quality per dollar of training, not parameter count or token count.
- We are not claiming production-ready inference on sub-32GB devices yet. The 10B story is "runs on a Mac Studio." The 70B story is "runs on a Mac Studio, better on a Mac Studio Ultra." The 150B story is "cloud or high-end workstation only."
Model naming transition (V3.3 launch)
Our current 10B / 40B / 70B / 150B labels count routed-expert parameters and understate the real model sizes. For V3.3 we are moving to the industry-standard TotalB-AyyB convention that DeepSeek, Mixtral, and Llama 4 use:
| Current name | V3.3 name | Total params | Active params |
|---|---|---|---|
| Outlier-10B | Outlier-13B-A7B | 13B | 7B |
| Outlier-40B | Outlier-30B-A14B | 30B | 14B |
| Outlier-70B | Outlier-68B-A32B | 68B | 32B |
| Outlier-150B | Outlier-150B-A70B | 150B | 70B |
V3.2 repos under the old names remain available, marked as superseded, pointing to the renamed V3.3 repos.
Two-variant launch
At public launch, each scale ships as two distinct SKUs:
- -Base — research weights, no safety training, clearly labeled "research use only." Apache 2.0. This is what the founder uses.
- -Instruct — same weights plus a Llama Guard 3 wrapper and (on 68B-A32B) a DPO safety LoRA adapter. Apache 2.0, with the safety layer distributed under its own compatible license. This is the default download and the one the Pro desktop app ships with.
Both variants publish identical benchmark numbers. The difference is the safety layer and the stated intended use.
Status
Pre-launch. V3.3 public release targets late Day 17 or Day 18 (approximately April 17–18, 2026), immediately after the current cluster sprint closes and the Pro desktop app is signed and notarized. Engine is already public.
Links
- Website: outlier.host
- Engine: github.com/Outlier-host/outlier
- Paper:
outlier_ternary_moe_2026.pdfv6 landing with the launch - Responsible use reports: abuse@outlier.host
- Contact: matt@outlier.host
- Built by: Matt Kerr · Kerr & Company LLC · Grand Rapids, MI
Changelog
- April 14, 2026: Day 14 cluster sprint running — alpha-fix reruns on 10B/40B/150B, speed experiments (EAGLE3, SWIFT, dead-expert pruning, FP4), LongRoPE 256K context, Llama Guard 3 + DPO safety tier. Org card updated with verified Day 13 numbers and V3.3 naming transition.
- April 13, 2026: Day 13 cluster sprint closed with five wins. 70B V3.3 alpha-fixed at 83.10% MMLU (n=14,042, stderr 0.30%). 150B V3.2 re-measured at 84.46% MMLU under lm_eval 0.4.9.1 (supersedes the earlier 83.16% value). YaRN 4x config validated for 128K context on 70B and 150B. V4 HESTIA + LoRA approach killed after regression, fully recovered via 280-scalar alpha-fix on a 15 KB overlay.
- April 11, 2026: Removed an unverified four-row MMLU table that relied on a decommissioned cluster's unsaved source files. The forensic cleanup led directly to the Day 9 provenance rules that now govern every number on this card.