YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

AstroX AI

A frontier-class Mixture-of-Experts language model — competitive with leading closed-source models at a fraction of the training cost. Fully open-source and commercially licensed.

Mixture-of-Experts 671B total params 37B activated per token 128K context FP8 training MIT License
huggingface.co/teamzero/astrox
671B
Total params
37B
Active per token
128K
Context window
2.79M
H800 GPU hours

Model
AstroX
Instruction-tuned chat model with reinforcement learning and advanced long-chain-of-thought reasoning distillation. The only available model in the AstroX family.
MoE 671B / 37B active 128K context FP8 weights
Architecture
MoE + Multi-head Latent Attention
Experts
256 total · 8 active
Pre-training data
14.8T tokens
License
MIT + Model Agreement
Architecture highlights
Attention
Multi-head Latent Attention (MLA)
Reduces KV cache memory footprint significantly vs. standard MHA, enabling practical long-context inference.
Load balancing
Auxiliary-loss-free strategy
Balances expert load without the performance penalty of traditional auxiliary loss terms.
Training objective
Multi-Token Prediction (MTP)
Predicts multiple future tokens simultaneously, boosting performance and enabling speculative decoding.
Post-training
Reasoning distillation
Verification and reflection patterns distilled from a long-CoT model, keeping output style and length controlled.
Key innovations
FP8 mixed precision
First validated large-scale FP8 training. Cuts compute cost without quality loss.
Zero training instability
No irrecoverable loss spikes and no rollbacks throughout the entire pre-training run.
Full comm/compute overlap
Co-designed algorithms and hardware nearly eliminate cross-node MoE communication bottlenecks.
Speculative decoding ready
The MTP module doubles as a draft head for inference acceleration out of the box.
Benchmark performance — math & reasoning
Benchmark GPT-4o Claude 3.5 Sonnet AstroX
AIME 2024 (Pass@1) 9.3 16.0 39.2
MATH-500 (EM) 74.6 78.3 90.2
CNMO 2024 (Pass@1) 10.8 13.1 43.2
GSM8K (EM) 89.3
Benchmark performance — code
Benchmark GPT-4o Claude 3.5 Sonnet AstroX
LiveCodeBench (Pass@1) 34.2 32.8 37.6
Codeforces (Percentile) 23.6 20.3 51.6
Aider-Polyglot (Acc.) 16.0 45.3 49.6
HumanEval-Mul (Pass@1) 80.5 81.7 82.6
Benchmark performance — general
Benchmark GPT-4o Claude 3.5 Sonnet AstroX
MMLU (EM) 87.2 88.3 88.5
Arena-Hard 80.4 85.2 85.5
AlpacaEval 2.0 51.1 52.0 70.0
DROP (3-shot F1) 83.7 88.3 91.6
Supported inference frameworks
SGLang
Recommended · FP8 + BF16 · NVIDIA + AMD
vLLM
FP8 + BF16 · pipeline parallelism
LMDeploy
Offline + online · PyTorch-native
TensorRT-LLM
BF16 · INT4/INT8 quant
AMD GPU
via SGLang · FP8 + BF16
Huawei Ascend
via MindIE · BF16
Quick start
Convert FP8 weights to BF16
python fp8_cast_bf16.py \
  --input-fp8-hf-path /path/to/fp8_weights \
  --output-bf16-hf-path /path/to/bf16_weights
Run interactive inference (2 nodes · 8 GPUs each)
torchrun --nnodes 2 --nproc-per-node 8 generate.py \
  --node-rank $RANK --master-addr $ADDR \
  --ckpt-path /path/to/AstroX \
  --config configs/config_671B.json \
  --interactive --temperature 0.7 --max-new-tokens 200
Code license: MIT  ·  Model license: Model Agreement  ·  Commercial use supported
huggingface.co/teamzero/astrox
Downloads last month
55
Safetensors
Model size
685B params
Tensor type
BF16
·
F8_E4M3
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support