YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
A frontier-class Mixture-of-Experts language model — competitive with leading closed-source models at a fraction of the training cost. Fully open-source and commercially licensed.
Mixture-of-Experts
671B total params
37B activated per token
128K context
FP8 training
MIT License
huggingface.co/
teamzero/astrox
671B
Total params
37B
Active per token
128K
Context window
2.79M
H800 GPU hours
Model
AstroX
Instruction-tuned chat model with reinforcement learning and advanced long-chain-of-thought reasoning distillation. The only available model in the AstroX family.
MoE
671B / 37B active
128K context
FP8 weights
Architecture
MoE + Multi-head Latent Attention
Experts
256 total · 8 active
Pre-training data
14.8T tokens
License
MIT + Model Agreement
Architecture highlights
Attention
Multi-head Latent Attention (MLA)
Reduces KV cache memory footprint significantly vs. standard MHA, enabling practical long-context inference.
Load balancing
Auxiliary-loss-free strategy
Balances expert load without the performance penalty of traditional auxiliary loss terms.
Training objective
Multi-Token Prediction (MTP)
Predicts multiple future tokens simultaneously, boosting performance and enabling speculative decoding.
Post-training
Reasoning distillation
Verification and reflection patterns distilled from a long-CoT model, keeping output style and length controlled.
Key innovations
FP8 mixed precision
First validated large-scale FP8 training. Cuts compute cost without quality loss.
Zero training instability
No irrecoverable loss spikes and no rollbacks throughout the entire pre-training run.
Full comm/compute overlap
Co-designed algorithms and hardware nearly eliminate cross-node MoE communication bottlenecks.
Speculative decoding ready
The MTP module doubles as a draft head for inference acceleration out of the box.
Benchmark performance — math & reasoning
| Benchmark | GPT-4o | Claude 3.5 Sonnet | AstroX |
|---|---|---|---|
| AIME 2024 (Pass@1) | 9.3 | 16.0 | 39.2 |
| MATH-500 (EM) | 74.6 | 78.3 | 90.2 |
| CNMO 2024 (Pass@1) | 10.8 | 13.1 | 43.2 |
| GSM8K (EM) | — | — | 89.3 |
Benchmark performance — code
| Benchmark | GPT-4o | Claude 3.5 Sonnet | AstroX |
|---|---|---|---|
| LiveCodeBench (Pass@1) | 34.2 | 32.8 | 37.6 |
| Codeforces (Percentile) | 23.6 | 20.3 | 51.6 |
| Aider-Polyglot (Acc.) | 16.0 | 45.3 | 49.6 |
| HumanEval-Mul (Pass@1) | 80.5 | 81.7 | 82.6 |
Benchmark performance — general
| Benchmark | GPT-4o | Claude 3.5 Sonnet | AstroX |
|---|---|---|---|
| MMLU (EM) | 87.2 | 88.3 | 88.5 |
| Arena-Hard | 80.4 | 85.2 | 85.5 |
| AlpacaEval 2.0 | 51.1 | 52.0 | 70.0 |
| DROP (3-shot F1) | 83.7 | 88.3 | 91.6 |
Supported inference frameworks
SGLang
Recommended · FP8 + BF16 · NVIDIA + AMD
vLLM
FP8 + BF16 · pipeline parallelism
LMDeploy
Offline + online · PyTorch-native
TensorRT-LLM
BF16 · INT4/INT8 quant
AMD GPU
via SGLang · FP8 + BF16
Huawei Ascend
via MindIE · BF16
Quick start
Convert FP8 weights to BF16
python fp8_cast_bf16.py \
--input-fp8-hf-path /path/to/fp8_weights \
--output-bf16-hf-path /path/to/bf16_weights
Run interactive inference (2 nodes · 8 GPUs each)
torchrun --nnodes 2 --nproc-per-node 8 generate.py \
--node-rank $RANK --master-addr $ADDR \
--ckpt-path /path/to/AstroX \
--config configs/config_671B.json \
--interactive --temperature 0.7 --max-new-tokens 200
Code license: MIT · Model license: Model Agreement · Commercial use supported
huggingface.co/teamzero/astrox
huggingface.co/teamzero/astrox
- Downloads last month
- 55
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support