Outlier-70B-V3.3

70B ternary MoE, 83.10% MMLU, 128K context, 15 KB upgrade over V3.2.

Status: production. Replaces V3.2 as the recommended 70B variant.

TL;DR

V3.3 is V3.2's weights + a 15 KB alpha overlay + a YaRN config patch. Same dense weights as V3.2; the upgrade lives entirely in alpha_overlay.pt (280 trained scalars) and a 3-line rope_scaling block in config.json.

Metric	V3.2	V3.3	Δ
MMLU	81.49%	83.10% ± 0.30%	+1.61pp
Context length	32,768	131,072	+4×
Files added	—	`alpha_overlay.pt` (15 KB), patched `config.json`	—

How V3.3 differs from V3.2

V3.2 is a Qwen2.5-32B base + ternary MoE delta experts. V3.3 keeps all V3.2 weights byte-for-byte identical and adds two things:

alpha_overlay.pt — a torch-pickled dict containing 280 trained alpha scalars (35 MoE layers × 8 experts each). At load time, the runtime reads this file and replaces the per-expert alpha_values buffers in the model with these trained values. The original V3.2 alpha values are preserved at originals inside the file in case you want to revert.
config.json rope_scaling block — adds {"type": "yarn", "factor": 4.0, "original_max_position_embeddings": 32768} and bumps max_position_embeddings from 32768 → 131072. This is a config-only change; the underlying RoPE base frequency is unchanged.

That's it. No new weights. No retraining. No new model architecture.

Loading

The Outlier runtime ships outlier.runtime.alpha_loader.load_alpha_overlay() which finds the right overlay in priority order:

from transformers import AutoModelForCausalLM, AutoTokenizer
from outlier.runtime.alpha_loader import load_alpha_overlay

tokenizer = AutoTokenizer.from_pretrained("Outlier-Ai/Outlier-70B-V3.3", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "Outlier-Ai/Outlier-70B-V3.3",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
)
# Apply the V3.3 alpha overlay
load_alpha_overlay(model, model_tag="70b_v3.3")

inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=20, do_sample=False)
print(tokenizer.decode(out[0]))

If you don't have the Outlier runtime installed, use the standalone snippet shipped in this repo (alpha_loader_snippet.py):

from alpha_loader_snippet import apply_v3_3_overlay
apply_v3_3_overlay(model, "alpha_overlay.pt")  # same effect, no Outlier dependency

Provenance

Metric	Value
MMLU	83.10% ± 0.30%
Sample size (n)	14,042
Stderr	±0.0030
Harness	`lm_eval` 0.4.9.1
Date measured	2026-04-14 (Day 13 cluster sprint)
Hardware	2× NVIDIA B200 SXM6
Source file	`verify_70b_alpha_fixed_mmlu.json`
Source SHA256	`6c6ea3f89426a95dd4d11883b79718f2ddcf632f8793e10b9fb665c728386398`

Full provenance chain in OUTLIER_GROUND_TRUTH_v10.md §2.3a.

Combined run (alpha + YaRN 4x context)

The MMLU number above was measured with the alpha overlay only at 32K context. We separately verified that YaRN 4x is lossless on short-context MMLU:

Variant	MMLU	Context
70B + alpha overlay (32K context)	83.10% ± 0.30%	32,768
70B + alpha overlay + YaRN 4x	83.10% ± 0.30%	131,072

Identical to four decimals → YaRN 4x extends the context for free without touching short-context behavior. The shipped V3.3 includes both.

YaRN 4x verification

Metric	Value
short-context perplexity (32K)	9.10
long-context perplexity, factor=2 (64K)	6.79
long-context perplexity, factor=4 (128K)	6.45

Both YaRN gates pass: long_ppl ≤ 1.15× short_ppl AND long_ppl ≤ 1.30× short_ppl. Source: phase4_yarn.json, SHA256 b73262cc17c7b6c144966b115c2847ac5509a0a8f65487d854a080779592ff69.

How the alpha overlay was trained

Field	Value
Trainable parameters	280 (35 layers × 8 experts)
Frozen parameters	rest of the model (~70B)
Steps	100
Batch size	1 (effective 4 with grad accumulation)
Sequence length	512
Optimizer	AdamW, lr=5e-4, weight_decay=0
Loss	Causal LM (cross entropy on next-token prediction)
Calibration data	1,000 prompts from `combined.jsonl` (held out from MMLU)
Held-out perplexity	116.96 → 102.52 (drop 14.44, 12% relative improvement)
Wall clock	~18 minutes on 2× B200
Cost	~$2 of cloud compute

The training script is in the Outlier repo at cluster_scripts/v4_alpha_fix.py from the Day 13 cluster sprint.

Why this works (brief)

The Outlier MoE architecture stores per-expert alpha contribution scalars (alpha_values buffer on each MoE MLP module) that gate how strongly each expert's delta is added to the shared base FFN output. V3.2 initializes these from the original training pipeline.

We previously tried V4 HESTIA + LoRA — applying LoRA adapters (~68M trainable params) to the shared MLP and attention layers. That approach regressed MMLU by -1.34pp on 70B. The diagnosis: LoRA was training the wrong parameters. The actual specialization knob is the per-expert alpha contribution. Train those directly (just 280 scalars, 250,000× fewer params than V4's LoRA), and the regression doesn't just disappear — the model exceeds the V3.2 baseline by +1.61pp.

This is also a publishable negative-result-turned-positive: full-fledged LoRA on the wrong layers (V4) is worse than a targeted 280-scalar tweak (V3.3) by 2.95 percentage points on the same benchmark, with 250,000× less trainable parameter count.

Limitations

Alpha overlay verified on MMLU only. HellaSwag, ARC, TruthfulQA, WinoGrande are not yet verified — see OUTLIER_GROUND_TRUTH_v10.md §2.6 for the verification queue.
The alpha overlay is trained on causal LM perplexity (a proxy task). It transfers to MMLU as shown, but transfer to all downstream tasks is not guaranteed.
128K context via YaRN extends the position encoding only — KV cache memory still scales linearly with context length, so very long contexts may need the optional INT8 KV cache (work-in-progress, see v10 §5.2).
150B does not yet have a V3.3 release. The same recipe is expected to transfer; future sprint.

Citation

If you use V3.3 in research, please cite:

@misc{outlier2026v33,
  title={Outlier-70B-V3.3: A 280-Scalar Alpha Overlay Recovers a 1.34 Percentage Point MMLU Regression on Ternary MoE Models},
  author={Matt Kerr},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/Outlier-Ai/Outlier-70B-V3.3}
}

Patents (filed):

Per-channel ternary scale recalibration
Cross-layer expert sharing (ReXMoE, used in 150B)
Alpha contribution overlay (this release)

License

Apache 2.0

Acknowledgments

Day 13 cluster sprint ran on 2× NVIDIA B200 via DataCrunch. Built in 14 days on $900 and a Mac Studio.

Downloads last month: 231

Model tree for Outlier-Ai/Outlier-70B-V3.3

Base model

Qwen/Qwen2.5-32B

Finetuned

Qwen/Qwen2.5-32B-Instruct

Finetuned

Outlier-Ai/Outlier-70B-V3.2

Finetuned

(1)

this model