Instructions to use Inferact/MiniMax-M3-EAGLE3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Inferact/MiniMax-M3-EAGLE3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Inferact/MiniMax-M3-EAGLE3")

# Load model directly
from transformers import AutoTokenizer, LlamaForCausalLMEagle3

tokenizer = AutoTokenizer.from_pretrained("Inferact/MiniMax-M3-EAGLE3")
model = LlamaForCausalLMEagle3.from_pretrained("Inferact/MiniMax-M3-EAGLE3")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Inferact/MiniMax-M3-EAGLE3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Inferact/MiniMax-M3-EAGLE3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Inferact/MiniMax-M3-EAGLE3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Inferact/MiniMax-M3-EAGLE3

SGLang

How to use Inferact/MiniMax-M3-EAGLE3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Inferact/MiniMax-M3-EAGLE3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Inferact/MiniMax-M3-EAGLE3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Inferact/MiniMax-M3-EAGLE3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Inferact/MiniMax-M3-EAGLE3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Inferact/MiniMax-M3-EAGLE3 with Docker Model Runner:
```
docker model run hf.co/Inferact/MiniMax-M3-EAGLE3
```

MiniMax-M3-EAGLE3 / README.md

rogerwyf

Update README.md

8dd0861 verified 6 days ago

preview code

Raw

History Blame Contribute Delete

4.15 kB

	---
	license: mit
	library_name: transformers
	base_model: MiniMaxAI/Minimax-M3-preview
	pipeline_tag: text-generation
	tags:
	- eagle3
	- speculative-decoding
	- draft-model
	- vllm
	- torchspec
	- minimax
	---

	## Model Overview

	Inferact/MiniMax-M3-EAGLE3 is an EAGLE3 draft model for accelerating inference of [MiniMax-M3](https://huggingface.co/MiniMaxAI/MiniMax-M3). It is served end-to-end with [vLLM](https://github.com/vllm-project/vllm) and was trained using [TorchSpec](https://github.com/lightseekorg/TorchSpec) — a torch-native online speculative-decoding training framework that runs FSDP training and vLLM-based target inference concurrently, learning from MiniMax-M3-regenerated responses and live vLLM-generated hidden states to match the base model's exact token distribution.

	The draft is a 1-layer dense Llama (`LlamaForCausalLMEagle3`, ~3.3 B params) operating on MiniMax-M3's `hidden_size=6144` / `vocab_size=200064`; at serve time it shares the target's embedding and LM head (EAGLE3). See `config.json` for the full architecture.

	---

	## Performance

	All numbers are measured end-to-end against `MiniMaxAI/MiniMax-M3-MXFP8` served with vLLM at `tensor-parallel-size=4`, `num_speculative_tokens=3`, and `--enforce-eager`. Greedy draft sampling (`topk=1`).

	\| Category \| Dataset \| n \| Mean Accept Length \| Draft Accept Rate \| Per-pos Accept Rate \|
	\|---\|---\|---:\|---:\|---:\|---\|
	\| Dialogue \| [MT-Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) \| 80 \| 2.698 \| 56.60% \| 0.749, 0.547, 0.402 \|
	\| Math \| [GSM8K](https://github.com/openai/grade-school-math) \| 200 \| 3.518 \| 83.93% \| 0.923, 0.839, 0.756 \|
	\| Code \| [HumanEval](https://huggingface.co/datasets/openai/openai_humaneval) \| 164 \| 3.499 \| 83.29% \| 0.922, 0.832, 0.744 \|
	\| Math \| [MATH500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500) \| 500 \| 3.517 \| 83.90% \| 0.929, 0.841, 0.747 \|
	\| Math \| [AIME](https://huggingface.co/datasets/Maxwell-Jia/AIME_2024) \| 30 \| 3.291 \| 76.36% \| 0.889, 0.763, 0.638 \|
	\| Synthetic \| speed-bench (16k, low-entropy) \| 64 \| 2.776 \| 59.21% \| 0.747, 0.576, 0.453 \|

	---

	## Training

	Data: ~456,881 training conversations (the `mix2` dataset: SWE-bench-Pro, SWE-bench, OpenCodeInstruct, kimi-mtp), with all responses regenerated by MiniMax-M3 — preserving the target's reasoning traces and MiniMax-M3 chat formatting.

	Method: EAGLE3 TTT, `ttt_length=7`, `max_seq_length=32 768`, AdamW at `lr=1 × 10⁻⁴` (cosine decay to 0, 2 % warmup, `max_grad_norm=1.0`), bf16 + gradient checkpointing, FlexAttention, 1 epoch (~14,277 steps). Trained on 5 × GB300 nodes (2 nodes FSDP2 draft training, dp=8, global batch 32 + 3 nodes vLLM TP=4 target inference). EAGLE3 aux hidden states from target layers (2, 30, 57) + the final layer. Embedding / LM head / final norm are shared from the target (M3 is a VL model, so these live under the `language_model.*` prefix).

	Core training command — `torchspec.train_entry` spawns the FSDP2 trainer and vLLM inference engines as decoupled Ray actors, streaming hidden states through Mooncake:

	```bash
	python3 -m torchspec.train_entry \
	--config configs/vllm_minimax_m3_mix2.yaml \
	model.draft_model_config=configs/draft_models/minimax_m3_eagle3.json \
	training.training_num_nodes=2 \
	training.training_num_gpus_per_node=4 \
	inference.inference_num_gpus=12 \
	inference.inference_num_gpus_per_engine=4 \
	inference.vllm.tp_size=4
	```

	Draft architecture, TTT depth, sequence length, cluster layout, and optimizer are all YAML-configurable — retargeting or scaling is a config change. See the [TorchSpec repo](https://github.com/lightseekorg/TorchSpec) for full customization instructions.

	---

	## Quick Start

	### Requirements

	- vLLM nightly with MiniMax-M3 support
	- Docker image `vllm/vllm-openai:minimax-m3`


	### Launch Server (vLLM)

	```bash
	vllm serve MiniMaxAI/MiniMax-M3-MXFP8 \
	--tensor-parallel-size 4 \
	--gpu-memory-utilization 0.90 \
	--block-size 128 \
	--speculative-config '{"method": "eagle3", "model": "Inferact/MiniMax-M3-EAGLE3", "num_speculative_tokens": 3, "attention_backend": "FLASH_ATTN"}'
	```