Be2Jay
/

AETHER-166B-A8B

Mixture of Experts

Model card Files Files and versions

AETHER-166B-A8B (Phase 0.5 — Oheng Calibrated)

166B Mixture-of-Experts model with Wu Xing (Five Elements) oheng architecture.

Model Details

Parameters: 166B total, ~8B active per token
Architecture: 25 layers, 5x5 MoE per layer, GDN+Mamba2+FullAttn hybrid
Parent: Qwen3.5-397B-A17B (weight transplant + oheng calibration)
Oheng: GenerateBoost (alpha[5]) + OvercomeGate
Calibration: v11, 3000 steps, sparse KLD, best_kld=14.28 @ step 1489
Precision: bfloat16

Training

Phase 0: Weight transplant from Qwen3.5-397B → AETHER-166B
Phase 0.5: Oheng calibration (offline sparse KLD distillation, 0.41% params)
Phase 1: Full distillation (planned)

Status

This is a Phase 0.5 checkpoint. The model is NOT yet suitable for inference. Full distillation (Phase 1) is required for usable generation quality.

Hardware

Trained on 8x NVIDIA H200 (143GB each) with pipeline parallel.

Downloads last month: 86

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Be2Jay/AETHER-166B-A8B

Base model

Qwen/Qwen3.5-397B-A17B

Finetuned

(25)

this model