Instructions to use Inferact/MiniMax-M3-EAGLE3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Inferact/MiniMax-M3-EAGLE3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Inferact/MiniMax-M3-EAGLE3")

# Load model directly
from transformers import AutoTokenizer, LlamaForCausalLMEagle3

tokenizer = AutoTokenizer.from_pretrained("Inferact/MiniMax-M3-EAGLE3")
model = LlamaForCausalLMEagle3.from_pretrained("Inferact/MiniMax-M3-EAGLE3")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Inferact/MiniMax-M3-EAGLE3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Inferact/MiniMax-M3-EAGLE3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Inferact/MiniMax-M3-EAGLE3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Inferact/MiniMax-M3-EAGLE3

SGLang

How to use Inferact/MiniMax-M3-EAGLE3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Inferact/MiniMax-M3-EAGLE3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Inferact/MiniMax-M3-EAGLE3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Inferact/MiniMax-M3-EAGLE3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Inferact/MiniMax-M3-EAGLE3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Inferact/MiniMax-M3-EAGLE3 with Docker Model Runner:
```
docker model run hf.co/Inferact/MiniMax-M3-EAGLE3
```

ZixiQi commited on 14 days ago

Commit

432b4da

verified ·

1 Parent(s): 2df74f4

Upload folder using huggingface_hub

Browse files

Files changed (2) hide show

README.md +95 -0
model.safetensors +1 -1

README.md CHANGED Viewed

@@ -1,3 +1,98 @@
 ---
 license: mit
 ---

 ---
 license: mit
+library_name: transformers
+base_model: MiniMaxAI/Minimax-M3-preview
+pipeline_tag: text-generation
+tags:
+  - eagle3
+  - speculative-decoding
+  - draft-model
+  - vllm
+  - torchspec
+  - minimax
 ---
+## Model Overview
+**Inferact/MiniMax-M3-EAGLE3** is an EAGLE3 draft model for accelerating inference of [MiniMax-M3](https://huggingface.co/MiniMaxAI/Minimax-M3-preview). It is served end-to-end with **[vLLM](https://github.com/vllm-project/vllm)** and was trained using **[TorchSpec](https://github.com/lightseekorg/TorchSpec)** — a torch-native online speculative-decoding training framework that runs FSDP training and vLLM-based target inference concurrently, learning from **MiniMax-M3-regenerated responses and live vLLM-generated hidden states** to match the base model's exact token distribution.
+The draft is a **1-layer** dense Llama (`LlamaForCausalLMEagle3`, ~3.3 B params) operating on MiniMax-M3's `hidden_size=6144` / `vocab_size=200064`; at serve time it shares the target's embedding and LM head (EAGLE3). See `config.json` for the full architecture.
+---
+## Performance
+All numbers are measured end-to-end against `Inferact/minimax-m3-final` (MXFP8) served with vLLM at `tensor-parallel-size=4`, `num_speculative_tokens=3`, and `--enforce-eager`. Greedy draft sampling (`topk=1`).
+| Category | Dataset | n | Mean Accept Length | Draft Accept Rate | Per-pos Accept Rate |
+|---|---|---:|---:|---:|---|
+| Dialogue | [MT-Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) | 80 | 2.698 | 56.60% | 0.749, 0.547, 0.402 |
+| Math | [GSM8K](https://github.com/openai/grade-school-math) | 200 | 3.518 | 83.93% | 0.923, 0.839, 0.756 |
+| Code | [HumanEval](https://huggingface.co/datasets/openai/openai_humaneval) | 164 | 3.499 | 83.29% | 0.922, 0.832, 0.744 |
+| Math | [MATH500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500) | 500 | 3.517 | 83.90% | 0.929, 0.841, 0.747 |
+| Math | [AIME](https://huggingface.co/datasets/Maxwell-Jia/AIME_2024) | 30 | 3.291 | 76.36% | 0.889, 0.763, 0.638 |
+| Synthetic | speed-bench (16k, low-entropy) | 64 | 2.776 | 59.21% | 0.747, 0.576, 0.453 |
+---
+## Training
+**Data:** ~456,881 training conversations (the `mix2` dataset: SWE-bench-Pro, SWE-bench, OpenCodeInstruct, kimi-mtp), with **all responses regenerated by MiniMax-M3** — preserving the target's reasoning traces and MiniMax-M3 chat formatting.
+**Method:** EAGLE3 TTT, `ttt_length=7`, `max_seq_length=32 768`, AdamW at `lr=1 × 10⁻⁴` (cosine decay to 0, 2 % warmup, `max_grad_norm=1.0`), bf16 + gradient checkpointing, FlexAttention, 1 epoch (~14,277 steps). Trained on **5 × GB300 nodes** (2 nodes FSDP2 draft training, dp=8, global batch 32 + 3 nodes vLLM TP=4 target inference). EAGLE3 aux hidden states from target layers (2, 30, 57) + the final layer. Embedding / LM head / final norm are shared from the target (M3 is a VL model, so these live under the `language_model.*` prefix).
+**Core training command** — `torchspec.train_entry` spawns the FSDP2 trainer and vLLM inference engines as decoupled Ray actors, streaming hidden states through Mooncake:
+```bash
+python3 -m torchspec.train_entry \
+  --config configs/vllm_minimax_m3_mix2.yaml \
+  model.draft_model_config=configs/draft_models/minimax_m3_eagle3.json \
+  training.training_num_nodes=2 \
+  training.training_num_gpus_per_node=4 \
+  inference.inference_num_gpus=12 \
+  inference.inference_num_gpus_per_engine=4 \
+  inference.vllm.tp_size=4
+```
+Draft architecture, TTT depth, sequence length, cluster layout, and optimizer are all YAML-configurable — retargeting or scaling is a config change. See the [TorchSpec repo](https://github.com/lightseekorg/TorchSpec) for full customization instructions.
+---
+## Quick Start
+### Requirements
+- NVIDIA Blackwell GPU (tested on B300), CUDA 13.0+ toolkit available.
+- A vLLM build with MiniMax-M3 + EAGLE3 speculative-decoding support.
+### Launch Server (vLLM)
+```bash
+vllm serve Inferact/minimax-m3-final \
+  --tensor-parallel-size 4 \
+  --gpu-memory-utilization 0.90 \
+  --max-model-len 65536 \
+  --block-size 128 \
+  --enforce-eager \
+  --no-enable-prefix-caching \
+  --speculative-config '{"method": "eagle3", "model": "Inferact/MiniMax-M3-EAGLE3", "num_speculative_tokens": 3, "attention_backend": "FLASH_ATTN"}'
+```
+### Run Benchmarks
+```bash
+vllm-bench \
+  --backend openai-chat \
+  --base-url http://127.0.0.1:8000 \
+  --model Inferact/minimax-m3-final \
+  --dataset-name speed-bench \
+  --speed-bench-config throughput_16k \
+  --speed-bench-max-input-len 10240 \
+  --speed-bench-category low_entropy \
+  --num-warmups 5 \
+  --num-prompts 1000 \
+  --output-len 1536 \
+  --sweep-max-concurrency 64 \
+  --sweep-num-prompts-factor 1 \
+  --save-result
+```

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:425b923f145ea377a9436831c237d090a72ac79bf77216c7a99574f35880a057
 size 6527473392

 version https://git-lfs.github.com/spec/v1
+oid sha256:3e1f32dc942bd49bd19bce54518eeeddda48b32070ea83fb9cd5d4787c185412
 size 6527473392