A frontier-class Mixture-of-Experts language model — competitive with leading closed-source models at a fraction of the training cost. Fully open-source and commercially licensed.
Mixture-of-Experts
671B total params
37B activated per token
128K context
FP8 training
MIT License
huggingface.co/teamzero/astrox
Model
AstroX
Instruction-tuned chat model with reinforcement learning and advanced long-chain-of-thought reasoning distillation. The only available model in the AstroX family.
MoE
671B / 37B active
128K context
FP8 weights
Architecture
MoE + Multi-head Latent Attention
Experts
256 total · 8 active
Pre-training data
14.8T tokens
License
MIT + Model Agreement
Architecture highlights
Attention
Multi-head Latent Attention (MLA)
Reduces KV cache memory footprint significantly vs. standard MHA, enabling practical long-context inference.
Load balancing
Auxiliary-loss-free strategy
Balances expert load without the performance penalty of traditional auxiliary loss terms.
Training objective
Multi-Token Prediction (MTP)
Predicts multiple future tokens simultaneously, boosting performance and enabling speculative decoding.
Post-training
Reasoning distillation
Verification and reflection patterns distilled from a long-CoT model, keeping output style and length controlled.
Key innovations
FP8 mixed precision
First validated large-scale FP8 training. Cuts compute cost without quality loss.
Zero training instability
No irrecoverable loss spikes and no rollbacks throughout the entire pre-training run.
Full comm/compute overlap
Co-designed algorithms and hardware nearly eliminate cross-node MoE communication bottlenecks.
Speculative decoding ready
The MTP module doubles as a draft head for inference acceleration out of the box.
Benchmark performance — math & reasoning
| Benchmark |
GPT-4o |
Claude 3.5 Sonnet |
AstroX |
| AIME 2024 (Pass@1) |
9.3 |
16.0 |
39.2 |
| MATH-500 (EM) |
74.6 |
78.3 |
90.2 |
| CNMO 2024 (Pass@1) |
10.8 |
13.1 |
43.2 |
| GSM8K (EM) |
— |
— |
89.3 |
Benchmark performance — code
| Benchmark |
GPT-4o |
Claude 3.5 Sonnet |
AstroX |
| LiveCodeBench (Pass@1) |
34.2 |
32.8 |
37.6 |
| Codeforces (Percentile) |
23.6 |
20.3 |
51.6 |
| Aider-Polyglot (Acc.) |
16.0 |
45.3 |
49.6 |
| HumanEval-Mul (Pass@1) |
80.5 |
81.7 |
82.6 |
Benchmark performance — general
| Benchmark |
GPT-4o |
Claude 3.5 Sonnet |
AstroX |
| MMLU (EM) |
87.2 |
88.3 |
88.5 |
| Arena-Hard |
80.4 |
85.2 |
85.5 |
| AlpacaEval 2.0 |
51.1 |
52.0 |
70.0 |
| DROP (3-shot F1) |
83.7 |
88.3 |
91.6 |
Supported inference frameworks
SGLang
Recommended · FP8 + BF16 · NVIDIA + AMD
vLLM
FP8 + BF16 · pipeline parallelism
LMDeploy
Offline + online · PyTorch-native
TensorRT-LLM
BF16 · INT4/INT8 quant
AMD GPU
via SGLang · FP8 + BF16
Huawei Ascend
via MindIE · BF16
Quick start
Convert FP8 weights to BF16
python fp8_cast_bf16.py \
--input-fp8-hf-path /path/to/fp8_weights \
--output-bf16-hf-path /path/to/bf16_weights
Run interactive inference (2 nodes · 8 GPUs each)
torchrun --nnodes 2 --nproc-per-node 8 generate.py \
--node-rank $RANK --master-addr $ADDR \
--ckpt-path /path/to/AstroX \
--config configs/config_671B.json \
--interactive --temperature 0.7 --max-new-tokens 200
Code license: MIT · Model license: Model Agreement · Commercial use supported
huggingface.co/teamzero/astrox