Instructions to use thoughtworks/MiniMax-M2.5-Eagle3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use thoughtworks/MiniMax-M2.5-Eagle3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="thoughtworks/MiniMax-M2.5-Eagle3")

# Load model directly
from transformers import AutoTokenizer, LlamaForCausalLMEagle3

tokenizer = AutoTokenizer.from_pretrained("thoughtworks/MiniMax-M2.5-Eagle3")
model = LlamaForCausalLMEagle3.from_pretrained("thoughtworks/MiniMax-M2.5-Eagle3")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use thoughtworks/MiniMax-M2.5-Eagle3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "thoughtworks/MiniMax-M2.5-Eagle3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/MiniMax-M2.5-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/thoughtworks/MiniMax-M2.5-Eagle3

SGLang

How to use thoughtworks/MiniMax-M2.5-Eagle3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "thoughtworks/MiniMax-M2.5-Eagle3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/MiniMax-M2.5-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "thoughtworks/MiniMax-M2.5-Eagle3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/MiniMax-M2.5-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use thoughtworks/MiniMax-M2.5-Eagle3 with Docker Model Runner:
```
docker model run hf.co/thoughtworks/MiniMax-M2.5-Eagle3
```

lujangusface commited on Apr 9

Commit

c5e920c

verified ·

1 Parent(s): 42b815f

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +163 -0

README.md ADDED Viewed

	@@ -0,0 +1,163 @@

+---
+library_name: transformers
+license: apache-2.0
+language:
+  - en
+base_model: MiniMaxAI/MiniMax-M2.5
+pipeline_tag: text-generation
+tags:
+  - eagle3
+  - speculative-decoding
+  - sglang
+  - draft-model
+  - moe
+  - mixture-of-experts
+---
+<!-- Internal: exp-f (gpu/minimax-m2) -->
+# EAGLE3 Draft Head — MiniMax-M2.5
+A lightweight EAGLE3 draft head for [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) (229B MoE, ~10B active parameters). Trained with [SpecForge](https://github.com/tails-mpt/SpecForge) on 8x H200 GPUs using the [EAGLE-3](https://arxiv.org/abs/2503.01840) training-time test objective.
+**Blog post**: [2x Faster on a 229B MoE: EAGLE3 Speculative Decoding for MiniMax-M2.5](https://huggingface.co/blog/lujangusface/tw-eagle3-minimax)
+## Usage
+### SGLang (GPU)
+Requires our [SGLang fork](https://github.com/tails-mpt/sglang) for MiniMax-M2.5 Eagle3 support + FP8 dtype fixes.
+**B=1 server** (wide tree — optimal for single-user, real-time requests):
+```bash
+pip install git+https://github.com/tails-mpt/sglang.git
+python -m sglang.launch_server \
+    --model-path MiniMaxAI/MiniMax-M2.5 \
+    --speculative-algorithm EAGLE3 \
+    --speculative-draft-model-path thoughtworks/MiniMax-M2.5-Eagle3 \
+    --speculative-num-steps 3 \
+    --speculative-num-draft-tokens 8 \
+    --speculative-eagle-topk 4 \
+    --dtype fp8 \
+    --tp 4 \
+    --port 30000
+```
+**B=32 server** (narrow tree — optimal for batch workloads):
+```bash
+python -m sglang.launch_server \
+    --model-path MiniMaxAI/MiniMax-M2.5 \
+    --speculative-algorithm EAGLE3 \
+    --speculative-draft-model-path thoughtworks/MiniMax-M2.5-Eagle3 \
+    --speculative-num-steps 5 \
+    --speculative-num-draft-tokens 6 \
+    --speculative-eagle-topk 1 \
+    --dtype fp8 \
+    --tp 4 \
+    --port 30002
+```
+**Important**: Use different speculative configs for B=1 vs B=32. A wider tree (topk=4) exploits idle GPU compute at low batch; a narrow tree (topk=1) minimizes MoE expert dispatch overhead at high batch.
+### Python Client
+```python
+import requests
+response = requests.post(
+    "http://localhost:30000/v1/chat/completions",
+    json={
+        "model": "default",
+        "messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
+        "max_tokens": 512,
+        "temperature": 0,
+    }
+)
+print(response.json()["choices"][0]["message"]["content"])
+```
+## Training Details
+| Parameter | Value |
+|-----------|-------|
+| Framework | [SpecForge](https://github.com/tails-mpt/SpecForge) (PyTorch), SGLang backend |
+| Hardware | 8x NVIDIA H200 144GB (TP=4, DP=2) |
+| Dataset | 20K regenerated samples (target-model responses at temp=0.8) |
+| Pre-training | 9 epochs on 54K mixed data (ShareGPT 45% / UltraChat 35% / PerfectBlend 20%) |
+| Fine-tuning | 6 epochs on 20K regenerated data |
+| Learning rate | 2e-5 (final stage) |
+| Optimizer | AdamW |
+| Batch size | 1 (per device) |
+| max_length | 2048 |
+| TTT (tree training tokens) | 7 |
+| Precision | bfloat16 |
+### Training Method
+EAGLE3 trains a single-layer draft head that predicts the next token using hidden states captured from three auxiliary layers of the target model (layers 1, 30, 58 — early, middle, and late). The training objective is the Training-Time Test (TTT) loss, which simulates the speculative decoding accept/reject process during training to maximize the expected number of accepted tokens at inference time.
+## Performance
+### Training Accuracy (base checkpoint, before regenerated data fine-tuning)
+| Position | Accuracy |
+|----------|----------|
+| acc_0 | 0.820 |
+| acc_1 | 0.809 |
+| acc_2 | 0.781 |
+| acc_3 | 0.789 |
+| acc_4 | 0.777 |
+| acc_5 | 0.761 |
+| acc_6 | 0.730 |
+*The released model was fine-tuned for 6 additional epochs on 20K regenerated samples from the target model. The fine-tuned accuracy is expected to be equal or higher than these base values.*
+### Inference Benchmarks (B=1, temp=0, FP8, TP=4)
+| Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
+|---------|-----------------|----------------|---------|
+| HumanEval | 109.3 | 230.6 | **2.11x** |
+| MT-Bench | 109.9 | 195.6 | **1.78x** |
+| SWEBench-Verified | 109.6 | 191.8 | **1.75x** |
+| Aider | 109.9 | 186.8 | **1.70x** |
+*Config: steps=3, topk=4, draft_tokens=8. All datasets at temp=0 on 8x H200 (TP=4).*
+## Model Architecture
+| Parameter | Value |
+|-----------|-------|
+| Architecture | LlamaForCausalLMEagle3 |
+| Hidden size | 3072 |
+| Num hidden layers | 1 |
+| Num attention heads | 24 (8 KV heads) |
+| Intermediate size | 8192 |
+| Auxiliary layers | [1, 30, 58] |
+| Vocab size | 200064 (target) / 32000 (draft) |
+| Checkpoint size | ~464 MB |
+## Limitations
+- **TP=4 only.** TP=8 fails due to FP8 block size constraint (`intermediate_size / 8 = 192`, not divisible by `block_n=128`).
+- **Temperature sensitivity.** Best performance at temp=0 (greedy). At temp=0.7, B=1 speedup drops to 1.27-1.80x and some B=32 datasets regress below baseline.
+- **Coding-focused benchmarks.** All benchmarks use coding-oriented datasets (HumanEval, SWEBench, Aider). Conversational workloads may show different patterns.
+- **SPEC_V2 incompatible.** The overlap scheduler (`SGLANG_ENABLE_SPEC_V2=true`) is not supported — standard (non-overlapped) speculation only.
+- **Requires SGLang fork.** Upstream SGLang does not yet include the FP8 dtype patches needed for Eagle3 on this model.
+## License
+This draft head is released under Apache 2.0, matching the [MiniMax-M2.5 license](https://huggingface.co/MiniMaxAI/MiniMax-M2.5).
+## Citation
+```bibtex
+@inproceedings{li2025eagle3,
+  title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
+  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
+  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
+  year={2025}
+}
+```