Text Generation
Transformers
Safetensors
llama
eagle3
speculative-decoding
draft-model
vllm
torchspec
minimax
text-generation-inference
Instructions to use Inferact/MiniMax-M3-EAGLE3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Inferact/MiniMax-M3-EAGLE3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Inferact/MiniMax-M3-EAGLE3")# Load model directly from transformers import AutoTokenizer, LlamaForCausalLMEagle3 tokenizer = AutoTokenizer.from_pretrained("Inferact/MiniMax-M3-EAGLE3") model = LlamaForCausalLMEagle3.from_pretrained("Inferact/MiniMax-M3-EAGLE3") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Inferact/MiniMax-M3-EAGLE3 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Inferact/MiniMax-M3-EAGLE3" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Inferact/MiniMax-M3-EAGLE3", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Inferact/MiniMax-M3-EAGLE3
- SGLang
How to use Inferact/MiniMax-M3-EAGLE3 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Inferact/MiniMax-M3-EAGLE3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Inferact/MiniMax-M3-EAGLE3", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Inferact/MiniMax-M3-EAGLE3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Inferact/MiniMax-M3-EAGLE3", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Inferact/MiniMax-M3-EAGLE3 with Docker Model Runner:
docker model run hf.co/Inferact/MiniMax-M3-EAGLE3
Upload folder using huggingface_hub
Browse files- README.md +95 -0
- model.safetensors +1 -1
README.md
CHANGED
|
@@ -1,3 +1,98 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
+
library_name: transformers
|
| 4 |
+
base_model: MiniMaxAI/Minimax-M3-preview
|
| 5 |
+
pipeline_tag: text-generation
|
| 6 |
+
tags:
|
| 7 |
+
- eagle3
|
| 8 |
+
- speculative-decoding
|
| 9 |
+
- draft-model
|
| 10 |
+
- vllm
|
| 11 |
+
- torchspec
|
| 12 |
+
- minimax
|
| 13 |
---
|
| 14 |
+
|
| 15 |
+
## Model Overview
|
| 16 |
+
|
| 17 |
+
**Inferact/MiniMax-M3-EAGLE3** is an EAGLE3 draft model for accelerating inference of [MiniMax-M3](https://huggingface.co/MiniMaxAI/Minimax-M3-preview). It is served end-to-end with **[vLLM](https://github.com/vllm-project/vllm)** and was trained using **[TorchSpec](https://github.com/lightseekorg/TorchSpec)** — a torch-native online speculative-decoding training framework that runs FSDP training and vLLM-based target inference concurrently, learning from **MiniMax-M3-regenerated responses and live vLLM-generated hidden states** to match the base model's exact token distribution.
|
| 18 |
+
|
| 19 |
+
The draft is a **1-layer** dense Llama (`LlamaForCausalLMEagle3`, ~3.3 B params) operating on MiniMax-M3's `hidden_size=6144` / `vocab_size=200064`; at serve time it shares the target's embedding and LM head (EAGLE3). See `config.json` for the full architecture.
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## Performance
|
| 24 |
+
|
| 25 |
+
All numbers are measured end-to-end against `Inferact/minimax-m3-final` (MXFP8) served with vLLM at `tensor-parallel-size=4`, `num_speculative_tokens=3`, and `--enforce-eager`. Greedy draft sampling (`topk=1`).
|
| 26 |
+
|
| 27 |
+
| Category | Dataset | n | Mean Accept Length | Draft Accept Rate | Per-pos Accept Rate |
|
| 28 |
+
|---|---|---:|---:|---:|---|
|
| 29 |
+
| Dialogue | [MT-Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) | 80 | 2.698 | 56.60% | 0.749, 0.547, 0.402 |
|
| 30 |
+
| Math | [GSM8K](https://github.com/openai/grade-school-math) | 200 | 3.518 | 83.93% | 0.923, 0.839, 0.756 |
|
| 31 |
+
| Code | [HumanEval](https://huggingface.co/datasets/openai/openai_humaneval) | 164 | 3.499 | 83.29% | 0.922, 0.832, 0.744 |
|
| 32 |
+
| Math | [MATH500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500) | 500 | 3.517 | 83.90% | 0.929, 0.841, 0.747 |
|
| 33 |
+
| Math | [AIME](https://huggingface.co/datasets/Maxwell-Jia/AIME_2024) | 30 | 3.291 | 76.36% | 0.889, 0.763, 0.638 |
|
| 34 |
+
| Synthetic | speed-bench (16k, low-entropy) | 64 | 2.776 | 59.21% | 0.747, 0.576, 0.453 |
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
## Training
|
| 39 |
+
|
| 40 |
+
**Data:** ~456,881 training conversations (the `mix2` dataset: SWE-bench-Pro, SWE-bench, OpenCodeInstruct, kimi-mtp), with **all responses regenerated by MiniMax-M3** — preserving the target's reasoning traces and MiniMax-M3 chat formatting.
|
| 41 |
+
|
| 42 |
+
**Method:** EAGLE3 TTT, `ttt_length=7`, `max_seq_length=32 768`, AdamW at `lr=1 × 10⁻⁴` (cosine decay to 0, 2 % warmup, `max_grad_norm=1.0`), bf16 + gradient checkpointing, FlexAttention, 1 epoch (~14,277 steps). Trained on **5 × GB300 nodes** (2 nodes FSDP2 draft training, dp=8, global batch 32 + 3 nodes vLLM TP=4 target inference). EAGLE3 aux hidden states from target layers (2, 30, 57) + the final layer. Embedding / LM head / final norm are shared from the target (M3 is a VL model, so these live under the `language_model.*` prefix).
|
| 43 |
+
|
| 44 |
+
**Core training command** — `torchspec.train_entry` spawns the FSDP2 trainer and vLLM inference engines as decoupled Ray actors, streaming hidden states through Mooncake:
|
| 45 |
+
|
| 46 |
+
```bash
|
| 47 |
+
python3 -m torchspec.train_entry \
|
| 48 |
+
--config configs/vllm_minimax_m3_mix2.yaml \
|
| 49 |
+
model.draft_model_config=configs/draft_models/minimax_m3_eagle3.json \
|
| 50 |
+
training.training_num_nodes=2 \
|
| 51 |
+
training.training_num_gpus_per_node=4 \
|
| 52 |
+
inference.inference_num_gpus=12 \
|
| 53 |
+
inference.inference_num_gpus_per_engine=4 \
|
| 54 |
+
inference.vllm.tp_size=4
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
Draft architecture, TTT depth, sequence length, cluster layout, and optimizer are all YAML-configurable — retargeting or scaling is a config change. See the [TorchSpec repo](https://github.com/lightseekorg/TorchSpec) for full customization instructions.
|
| 58 |
+
|
| 59 |
+
---
|
| 60 |
+
|
| 61 |
+
## Quick Start
|
| 62 |
+
|
| 63 |
+
### Requirements
|
| 64 |
+
|
| 65 |
+
- NVIDIA Blackwell GPU (tested on B300), CUDA 13.0+ toolkit available.
|
| 66 |
+
- A vLLM build with MiniMax-M3 + EAGLE3 speculative-decoding support.
|
| 67 |
+
|
| 68 |
+
### Launch Server (vLLM)
|
| 69 |
+
|
| 70 |
+
```bash
|
| 71 |
+
vllm serve Inferact/minimax-m3-final \
|
| 72 |
+
--tensor-parallel-size 4 \
|
| 73 |
+
--gpu-memory-utilization 0.90 \
|
| 74 |
+
--max-model-len 65536 \
|
| 75 |
+
--block-size 128 \
|
| 76 |
+
--enforce-eager \
|
| 77 |
+
--no-enable-prefix-caching \
|
| 78 |
+
--speculative-config '{"method": "eagle3", "model": "Inferact/MiniMax-M3-EAGLE3", "num_speculative_tokens": 3, "attention_backend": "FLASH_ATTN"}'
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
### Run Benchmarks
|
| 82 |
+
|
| 83 |
+
```bash
|
| 84 |
+
vllm-bench \
|
| 85 |
+
--backend openai-chat \
|
| 86 |
+
--base-url http://127.0.0.1:8000 \
|
| 87 |
+
--model Inferact/minimax-m3-final \
|
| 88 |
+
--dataset-name speed-bench \
|
| 89 |
+
--speed-bench-config throughput_16k \
|
| 90 |
+
--speed-bench-max-input-len 10240 \
|
| 91 |
+
--speed-bench-category low_entropy \
|
| 92 |
+
--num-warmups 5 \
|
| 93 |
+
--num-prompts 1000 \
|
| 94 |
+
--output-len 1536 \
|
| 95 |
+
--sweep-max-concurrency 64 \
|
| 96 |
+
--sweep-num-prompts-factor 1 \
|
| 97 |
+
--save-result
|
| 98 |
+
```
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 6527473392
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3e1f32dc942bd49bd19bce54518eeeddda48b32070ea83fb9cd5d4787c185412
|
| 3 |
size 6527473392
|