Outlier-70B-V3.3
70B ternary MoE, 83.10% MMLU, 128K context, 15 KB upgrade over V3.2.
Status: production. Replaces V3.2 as the recommended 70B variant.
TL;DR
V3.3 is V3.2's weights + a 15 KB alpha overlay + a YaRN config patch. Same dense weights as V3.2; the upgrade lives entirely in alpha_overlay.pt (280 trained scalars) and a 3-line rope_scaling block in config.json.
| Metric | V3.2 | V3.3 | Δ |
|---|---|---|---|
| MMLU | 81.49% | 83.10% ± 0.30% | +1.61pp |
| Context length | 32,768 | 131,072 | +4× |
| Files added | — | alpha_overlay.pt (15 KB), patched config.json |
— |
How V3.3 differs from V3.2
V3.2 is a Qwen2.5-32B base + ternary MoE delta experts. V3.3 keeps all V3.2 weights byte-for-byte identical and adds two things:
alpha_overlay.pt— a torch-pickled dict containing 280 trained alpha scalars (35 MoE layers × 8 experts each). At load time, the runtime reads this file and replaces the per-expertalpha_valuesbuffers in the model with these trained values. The original V3.2 alpha values are preserved atoriginalsinside the file in case you want to revert.config.jsonrope_scalingblock — adds{"type": "yarn", "factor": 4.0, "original_max_position_embeddings": 32768}and bumpsmax_position_embeddingsfrom 32768 → 131072. This is a config-only change; the underlying RoPE base frequency is unchanged.
That's it. No new weights. No retraining. No new model architecture.
Loading
The Outlier runtime ships outlier.runtime.alpha_loader.load_alpha_overlay() which finds the right overlay in priority order:
from transformers import AutoModelForCausalLM, AutoTokenizer
from outlier.runtime.alpha_loader import load_alpha_overlay
tokenizer = AutoTokenizer.from_pretrained("Outlier-Ai/Outlier-70B-V3.3", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"Outlier-Ai/Outlier-70B-V3.3",
trust_remote_code=True,
torch_dtype="bfloat16",
device_map="auto",
)
# Apply the V3.3 alpha overlay
load_alpha_overlay(model, model_tag="70b_v3.3")
inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=20, do_sample=False)
print(tokenizer.decode(out[0]))
If you don't have the Outlier runtime installed, use the standalone snippet shipped in this repo (alpha_loader_snippet.py):
from alpha_loader_snippet import apply_v3_3_overlay
apply_v3_3_overlay(model, "alpha_overlay.pt") # same effect, no Outlier dependency
Provenance
| Metric | Value |
|---|---|
| MMLU | 83.10% ± 0.30% |
| Sample size (n) | 14,042 |
| Stderr | ±0.0030 |
| Harness | lm_eval 0.4.9.1 |
| Date measured | 2026-04-14 (Day 13 cluster sprint) |
| Hardware | 2× NVIDIA B200 SXM6 |
| Source file | verify_70b_alpha_fixed_mmlu.json |
| Source SHA256 | 6c6ea3f89426a95dd4d11883b79718f2ddcf632f8793e10b9fb665c728386398 |
Full provenance chain in OUTLIER_GROUND_TRUTH_v10.md §2.3a.
Combined run (alpha + YaRN 4x context)
The MMLU number above was measured with the alpha overlay only at 32K context. We separately verified that YaRN 4x is lossless on short-context MMLU:
| Variant | MMLU | Context |
|---|---|---|
| 70B + alpha overlay (32K context) | 83.10% ± 0.30% | 32,768 |
| 70B + alpha overlay + YaRN 4x | 83.10% ± 0.30% | 131,072 |
Identical to four decimals → YaRN 4x extends the context for free without touching short-context behavior. The shipped V3.3 includes both.
YaRN 4x verification
| Metric | Value |
|---|---|
| short-context perplexity (32K) | 9.10 |
| long-context perplexity, factor=2 (64K) | 6.79 |
| long-context perplexity, factor=4 (128K) | 6.45 |
Both YaRN gates pass: long_ppl ≤ 1.15× short_ppl AND long_ppl ≤ 1.30× short_ppl. Source: phase4_yarn.json, SHA256 b73262cc17c7b6c144966b115c2847ac5509a0a8f65487d854a080779592ff69.
How the alpha overlay was trained
| Field | Value |
|---|---|
| Trainable parameters | 280 (35 layers × 8 experts) |
| Frozen parameters | rest of the model (~70B) |
| Steps | 100 |
| Batch size | 1 (effective 4 with grad accumulation) |
| Sequence length | 512 |
| Optimizer | AdamW, lr=5e-4, weight_decay=0 |
| Loss | Causal LM (cross entropy on next-token prediction) |
| Calibration data | 1,000 prompts from combined.jsonl (held out from MMLU) |
| Held-out perplexity | 116.96 → 102.52 (drop 14.44, 12% relative improvement) |
| Wall clock | ~18 minutes on 2× B200 |
| Cost | ~$2 of cloud compute |
The training script is in the Outlier repo at cluster_scripts/v4_alpha_fix.py from the Day 13 cluster sprint.
Why this works (brief)
The Outlier MoE architecture stores per-expert alpha contribution scalars (alpha_values buffer on each MoE MLP module) that gate how strongly each expert's delta is added to the shared base FFN output. V3.2 initializes these from the original training pipeline.
We previously tried V4 HESTIA + LoRA — applying LoRA adapters (~68M trainable params) to the shared MLP and attention layers. That approach regressed MMLU by -1.34pp on 70B. The diagnosis: LoRA was training the wrong parameters. The actual specialization knob is the per-expert alpha contribution. Train those directly (just 280 scalars, 250,000× fewer params than V4's LoRA), and the regression doesn't just disappear — the model exceeds the V3.2 baseline by +1.61pp.
This is also a publishable negative-result-turned-positive: full-fledged LoRA on the wrong layers (V4) is worse than a targeted 280-scalar tweak (V3.3) by 2.95 percentage points on the same benchmark, with 250,000× less trainable parameter count.
Limitations
- Alpha overlay verified on MMLU only. HellaSwag, ARC, TruthfulQA, WinoGrande are not yet verified — see
OUTLIER_GROUND_TRUTH_v10.md§2.6 for the verification queue. - The alpha overlay is trained on causal LM perplexity (a proxy task). It transfers to MMLU as shown, but transfer to all downstream tasks is not guaranteed.
- 128K context via YaRN extends the position encoding only — KV cache memory still scales linearly with context length, so very long contexts may need the optional INT8 KV cache (work-in-progress, see v10 §5.2).
- 150B does not yet have a V3.3 release. The same recipe is expected to transfer; future sprint.
Citation
If you use V3.3 in research, please cite:
@misc{outlier2026v33,
title={Outlier-70B-V3.3: A 280-Scalar Alpha Overlay Recovers a 1.34 Percentage Point MMLU Regression on Ternary MoE Models},
author={Matt Kerr},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/Outlier-Ai/Outlier-70B-V3.3}
}
Patents (filed):
- Per-channel ternary scale recalibration
- Cross-layer expert sharing (ReXMoE, used in 150B)
- Alpha contribution overlay (this release)
License
Apache 2.0
Acknowledgments
Day 13 cluster sprint ran on 2× NVIDIA B200 via DataCrunch. Built in 14 days on $900 and a Mac Studio.
- Downloads last month
- 231
Model tree for Outlier-Ai/Outlier-70B-V3.3
Base model
Qwen/Qwen2.5-32B