Outlier-150B-V3.2
Status: production. MMLU re-verified Day 13 at 84.46%. Flagship Outlier release.
The largest Outlier model. Built on Qwen/Qwen2.5-72B-Instruct with the ReXMoE architecture — cross-layer expert sharing, where experts are shared across groups of layers via PSR (per-scale-residual) variants. 88 unique experts shared via 44 routers across 11 groups × 4 PSR variants.
Model summary
| Field | Value |
|---|---|
| Base model | Qwen/Qwen2.5-72B-Instruct |
| Architecture | Outlier ReXMoE (cross-layer expert sharing) |
| Parameters | ~150B effective |
| Context length | 32,768 tokens |
| MoE layers | (see config.json) |
| Unique experts | 88 (shared across 44 routers) |
| Expert groups | 11 |
| PSR variants | 4 |
| Expert quantization | Ternary (int8 + per-row fp16 scale) |
| MMLU (full sample, day13) | 84.46% ± 0.29% |
Provenance
| Metric | Value |
|---|---|
| MMLU | 84.46% ± 0.29% |
| Sample size (n) | 14,042 |
| Stderr | ±0.0029 |
| Harness | lm_eval 0.4.9.1 |
| Date measured | 2026-04-14 (Day 13 cluster sprint) |
| Hardware | 2× NVIDIA B200 SXM6 |
| Source file | phase8_upgraded_150b_full.json |
| Source SHA256 | 5db066e5574e6bc1e3f1dec452098aa6d1be44333e7ea32f9561288babb3b228 |
Full provenance in OUTLIER_GROUND_TRUTH_v10.md §2.4.
Day 12 → Day 13 measurement drift (documented, unresolved)
| Day | Harness | MMLU |
|---|---|---|
| Day 12 | lm_eval 0.4.11 |
83.16% ± 0.31% |
| Day 13 | lm_eval 0.4.9.1 |
84.46% ± 0.29% |
| Drift | — | +1.30pp |
Day 13 is accepted as canonical. The 1.30pp drift between harness versions has not been root-caused — it could be an lm_eval version difference (known small differences in MMLU prompt formatting across versions), a transformers point-release difference, or a Day 12 measurement artifact. We did not rerun Day 12's exact pipeline to pin down the cause. See v10 §2.4 for the methodology notes.
V3.3 status
A 150B alpha-fix overlay has not yet been trained. The recipe that worked for 70B (280 trained alpha scalars, ~30 minutes of cloud compute) is expected to transfer. Future sprint. Until then, the production 150B is V3.2 with the day13 number above.
A 128K-context variant (150B + YaRN 4×) is pending separate release — the YaRN config patch is verified safe on 70B; the same patch should work on 150B identically but has not been published as a separate repo yet.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Outlier-Ai/Outlier-150B-V3.2", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"Outlier-Ai/Outlier-150B-V3.2",
trust_remote_code=True,
torch_dtype="bfloat16",
device_map="auto", # 150B requires multi-GPU; ~280 GB bf16
)
inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=20, do_sample=False)
print(tokenizer.decode(out[0]))
Hardware notes: 150B at bf16 needs ~280 GB of VRAM. We've successfully run it on 2× NVIDIA B200 (180 GB each, 360 GB total). Single-GPU bf16 inference is not currently feasible without further quantization.
Limitations
- Secondary benchmarks (HellaSwag, ARC, TruthfulQA, WinoGrande) are
[UNVERIFIED]. Day 13 sprint was trimmed to MMLU-only due to throughput constraints. - 128K context via YaRN 4× verified on 70B; not yet packaged as a separate 150B release.
- V3.3 alpha-fix overlay not trained for 150B yet.
License
Apache 2.0
- Downloads last month
- 806