Outlier-70B-V3.3

70B ternary MoE, 83.10% MMLU, 128K context, 15 KB upgrade over V3.2.

Status: production. Replaces V3.2 as the recommended 70B variant.

TL;DR

V3.3 is V3.2's weights + a 15 KB alpha overlay + a YaRN config patch. Same dense weights as V3.2; the upgrade lives entirely in alpha_overlay.pt (280 trained scalars) and a 3-line rope_scaling block in config.json.

Metric V3.2 V3.3 Δ
MMLU 81.49% 83.10% ± 0.30% +1.61pp
Context length 32,768 131,072 +4×
Files added — alpha_overlay.pt (15 KB), patched config.json —

How V3.3 differs from V3.2

V3.2 is a Qwen2.5-32B base + ternary MoE delta experts. V3.3 keeps all V3.2 weights byte-for-byte identical and adds two things:

  1. alpha_overlay.pt — a torch-pickled dict containing 280 trained alpha scalars (35 MoE layers × 8 experts each). At load time, the runtime reads this file and replaces the per-expert alpha_values buffers in the model with these trained values. The original V3.2 alpha values are preserved at originals inside the file in case you want to revert.

  2. config.json rope_scaling block — adds {"type": "yarn", "factor": 4.0, "original_max_position_embeddings": 32768} and bumps max_position_embeddings from 32768 → 131072. This is a config-only change; the underlying RoPE base frequency is unchanged.

That's it. No new weights. No retraining. No new model architecture.

Loading

The Outlier runtime ships outlier.runtime.alpha_loader.load_alpha_overlay() which finds the right overlay in priority order:

from transformers import AutoModelForCausalLM, AutoTokenizer
from outlier.runtime.alpha_loader import load_alpha_overlay

tokenizer = AutoTokenizer.from_pretrained("Outlier-Ai/Outlier-70B-V3.3", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "Outlier-Ai/Outlier-70B-V3.3",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
)
# Apply the V3.3 alpha overlay
load_alpha_overlay(model, model_tag="70b_v3.3")

inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=20, do_sample=False)
print(tokenizer.decode(out[0]))

If you don't have the Outlier runtime installed, use the standalone snippet shipped in this repo (alpha_loader_snippet.py):

from alpha_loader_snippet import apply_v3_3_overlay
apply_v3_3_overlay(model, "alpha_overlay.pt")  # same effect, no Outlier dependency

Provenance

Metric Value
MMLU 83.10% ± 0.30%
Sample size (n) 14,042
Stderr ±0.0030
Harness lm_eval 0.4.9.1
Date measured 2026-04-14 (Day 13 cluster sprint)
Hardware 2× NVIDIA B200 SXM6
Source file verify_70b_alpha_fixed_mmlu.json
Source SHA256 6c6ea3f89426a95dd4d11883b79718f2ddcf632f8793e10b9fb665c728386398

Full provenance chain in OUTLIER_GROUND_TRUTH_v10.md §2.3a.

Combined run (alpha + YaRN 4x context)

The MMLU number above was measured with the alpha overlay only at 32K context. We separately verified that YaRN 4x is lossless on short-context MMLU:

Variant MMLU Context
70B + alpha overlay (32K context) 83.10% ± 0.30% 32,768
70B + alpha overlay + YaRN 4x 83.10% ± 0.30% 131,072

Identical to four decimals → YaRN 4x extends the context for free without touching short-context behavior. The shipped V3.3 includes both.

YaRN 4x verification

Metric Value
short-context perplexity (32K) 9.10
long-context perplexity, factor=2 (64K) 6.79
long-context perplexity, factor=4 (128K) 6.45

Both YaRN gates pass: long_ppl ≤ 1.15× short_ppl AND long_ppl ≤ 1.30× short_ppl. Source: phase4_yarn.json, SHA256 b73262cc17c7b6c144966b115c2847ac5509a0a8f65487d854a080779592ff69.

How the alpha overlay was trained

Field Value
Trainable parameters 280 (35 layers × 8 experts)
Frozen parameters rest of the model (~70B)
Steps 100
Batch size 1 (effective 4 with grad accumulation)
Sequence length 512
Optimizer AdamW, lr=5e-4, weight_decay=0
Loss Causal LM (cross entropy on next-token prediction)
Calibration data 1,000 prompts from combined.jsonl (held out from MMLU)
Held-out perplexity 116.96 → 102.52 (drop 14.44, 12% relative improvement)
Wall clock ~18 minutes on 2× B200
Cost ~$2 of cloud compute

The training script is in the Outlier repo at cluster_scripts/v4_alpha_fix.py from the Day 13 cluster sprint.

Why this works (brief)

The Outlier MoE architecture stores per-expert alpha contribution scalars (alpha_values buffer on each MoE MLP module) that gate how strongly each expert's delta is added to the shared base FFN output. V3.2 initializes these from the original training pipeline.

We previously tried V4 HESTIA + LoRA — applying LoRA adapters (~68M trainable params) to the shared MLP and attention layers. That approach regressed MMLU by -1.34pp on 70B. The diagnosis: LoRA was training the wrong parameters. The actual specialization knob is the per-expert alpha contribution. Train those directly (just 280 scalars, 250,000× fewer params than V4's LoRA), and the regression doesn't just disappear — the model exceeds the V3.2 baseline by +1.61pp.

This is also a publishable negative-result-turned-positive: full-fledged LoRA on the wrong layers (V4) is worse than a targeted 280-scalar tweak (V3.3) by 2.95 percentage points on the same benchmark, with 250,000× less trainable parameter count.

Limitations

  • Alpha overlay verified on MMLU only. HellaSwag, ARC, TruthfulQA, WinoGrande are not yet verified — see OUTLIER_GROUND_TRUTH_v10.md §2.6 for the verification queue.
  • The alpha overlay is trained on causal LM perplexity (a proxy task). It transfers to MMLU as shown, but transfer to all downstream tasks is not guaranteed.
  • 128K context via YaRN extends the position encoding only — KV cache memory still scales linearly with context length, so very long contexts may need the optional INT8 KV cache (work-in-progress, see v10 §5.2).
  • 150B does not yet have a V3.3 release. The same recipe is expected to transfer; future sprint.

Citation

If you use V3.3 in research, please cite:

@misc{outlier2026v33,
  title={Outlier-70B-V3.3: A 280-Scalar Alpha Overlay Recovers a 1.34 Percentage Point MMLU Regression on Ternary MoE Models},
  author={Matt Kerr},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/Outlier-Ai/Outlier-70B-V3.3}
}

Patents (filed):

  • Per-channel ternary scale recalibration
  • Cross-layer expert sharing (ReXMoE, used in 150B)
  • Alpha contribution overlay (this release)

License

Apache 2.0

Acknowledgments

Day 13 cluster sprint ran on 2× NVIDIA B200 via DataCrunch. Built in 14 days on $900 and a Mac Studio.

Downloads last month
231
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Outlier-Ai/Outlier-70B-V3.3

Base model

Qwen/Qwen2.5-32B
Finetuned
(1)
this model