A frontier-class Mixture-of-Experts language model — competitive with leading closed-source models at a fraction of the training cost. Fully open-source and commercially licensed.

Mixture-of-Experts 671B total params 37B activated per token 128K context FP8 training MIT License

huggingface.co/teamzero/astrox

671B

Total params

37B

Active per token

128K

Context window

2.79M

H800 GPU hours

Model

AstroX

Instruction-tuned chat model with reinforcement learning and advanced long-chain-of-thought reasoning distillation. The only available model in the AstroX family.

MoE 671B / 37B active 128K context FP8 weights

Architecture

MoE + Multi-head Latent Attention

Experts

256 total · 8 active

Pre-training data

14.8T tokens

License

MIT + Model Agreement

Architecture highlights

Attention

Multi-head Latent Attention (MLA)

Reduces KV cache memory footprint significantly vs. standard MHA, enabling practical long-context inference.

Load balancing

Auxiliary-loss-free strategy

Balances expert load without the performance penalty of traditional auxiliary loss terms.

Training objective

Multi-Token Prediction (MTP)

Predicts multiple future tokens simultaneously, boosting performance and enabling speculative decoding.

Post-training

Reasoning distillation

Verification and reflection patterns distilled from a long-CoT model, keeping output style and length controlled.

Key innovations

FP8 mixed precision

First validated large-scale FP8 training. Cuts compute cost without quality loss.

Zero training instability

No irrecoverable loss spikes and no rollbacks throughout the entire pre-training run.

Full comm/compute overlap

Co-designed algorithms and hardware nearly eliminate cross-node MoE communication bottlenecks.

Speculative decoding ready

The MTP module doubles as a draft head for inference acceleration out of the box.

Benchmark performance — math & reasoning

Benchmark	GPT-4o	Claude 3.5 Sonnet	AstroX
AIME 2024 (Pass@1)	9.3	16.0	39.2
MATH-500 (EM)	74.6	78.3	90.2
CNMO 2024 (Pass@1)	10.8	13.1	43.2
GSM8K (EM)	—	—	89.3

Benchmark performance — code

Benchmark	GPT-4o	Claude 3.5 Sonnet	AstroX
LiveCodeBench (Pass@1)	34.2	32.8	37.6
Codeforces (Percentile)	23.6	20.3	51.6
Aider-Polyglot (Acc.)	16.0	45.3	49.6
HumanEval-Mul (Pass@1)	80.5	81.7	82.6

Benchmark performance — general

Benchmark	GPT-4o	Claude 3.5 Sonnet	AstroX
MMLU (EM)	87.2	88.3	88.5
Arena-Hard	80.4	85.2	85.5
AlpacaEval 2.0	51.1	52.0	70.0
DROP (3-shot F1)	83.7	88.3	91.6

Supported inference frameworks

SGLang

Recommended · FP8 + BF16 · NVIDIA + AMD

vLLM

FP8 + BF16 · pipeline parallelism

LMDeploy

Offline + online · PyTorch-native

TensorRT-LLM

BF16 · INT4/INT8 quant

AMD GPU

via SGLang · FP8 + BF16

Huawei Ascend

via MindIE · BF16

Quick start

Convert FP8 weights to BF16

python fp8_cast_bf16.py \
  --input-fp8-hf-path /path/to/fp8_weights \
  --output-bf16-hf-path /path/to/bf16_weights

Run interactive inference (2 nodes · 8 GPUs each)

torchrun --nnodes 2 --nproc-per-node 8 generate.py \
  --node-rank $RANK --master-addr $ADDR \
  --ckpt-path /path/to/AstroX \
  --config configs/config_671B.json \
  --interactive --temperature 0.7 --max-new-tokens 200

Code license: MIT · Model license: Model Agreement · Commercial use supported
huggingface.co/teamzero/astrox