ZixiQi commited on
Commit
432b4da
·
verified ·
1 Parent(s): 2df74f4

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +95 -0
  2. model.safetensors +1 -1
README.md CHANGED
@@ -1,3 +1,98 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ library_name: transformers
4
+ base_model: MiniMaxAI/Minimax-M3-preview
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - eagle3
8
+ - speculative-decoding
9
+ - draft-model
10
+ - vllm
11
+ - torchspec
12
+ - minimax
13
  ---
14
+
15
+ ## Model Overview
16
+
17
+ **Inferact/MiniMax-M3-EAGLE3** is an EAGLE3 draft model for accelerating inference of [MiniMax-M3](https://huggingface.co/MiniMaxAI/Minimax-M3-preview). It is served end-to-end with **[vLLM](https://github.com/vllm-project/vllm)** and was trained using **[TorchSpec](https://github.com/lightseekorg/TorchSpec)** — a torch-native online speculative-decoding training framework that runs FSDP training and vLLM-based target inference concurrently, learning from **MiniMax-M3-regenerated responses and live vLLM-generated hidden states** to match the base model's exact token distribution.
18
+
19
+ The draft is a **1-layer** dense Llama (`LlamaForCausalLMEagle3`, ~3.3 B params) operating on MiniMax-M3's `hidden_size=6144` / `vocab_size=200064`; at serve time it shares the target's embedding and LM head (EAGLE3). See `config.json` for the full architecture.
20
+
21
+ ---
22
+
23
+ ## Performance
24
+
25
+ All numbers are measured end-to-end against `Inferact/minimax-m3-final` (MXFP8) served with vLLM at `tensor-parallel-size=4`, `num_speculative_tokens=3`, and `--enforce-eager`. Greedy draft sampling (`topk=1`).
26
+
27
+ | Category | Dataset | n | Mean Accept Length | Draft Accept Rate | Per-pos Accept Rate |
28
+ |---|---|---:|---:|---:|---|
29
+ | Dialogue | [MT-Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) | 80 | 2.698 | 56.60% | 0.749, 0.547, 0.402 |
30
+ | Math | [GSM8K](https://github.com/openai/grade-school-math) | 200 | 3.518 | 83.93% | 0.923, 0.839, 0.756 |
31
+ | Code | [HumanEval](https://huggingface.co/datasets/openai/openai_humaneval) | 164 | 3.499 | 83.29% | 0.922, 0.832, 0.744 |
32
+ | Math | [MATH500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500) | 500 | 3.517 | 83.90% | 0.929, 0.841, 0.747 |
33
+ | Math | [AIME](https://huggingface.co/datasets/Maxwell-Jia/AIME_2024) | 30 | 3.291 | 76.36% | 0.889, 0.763, 0.638 |
34
+ | Synthetic | speed-bench (16k, low-entropy) | 64 | 2.776 | 59.21% | 0.747, 0.576, 0.453 |
35
+
36
+ ---
37
+
38
+ ## Training
39
+
40
+ **Data:** ~456,881 training conversations (the `mix2` dataset: SWE-bench-Pro, SWE-bench, OpenCodeInstruct, kimi-mtp), with **all responses regenerated by MiniMax-M3** — preserving the target's reasoning traces and MiniMax-M3 chat formatting.
41
+
42
+ **Method:** EAGLE3 TTT, `ttt_length=7`, `max_seq_length=32 768`, AdamW at `lr=1 × 10⁻⁴` (cosine decay to 0, 2 % warmup, `max_grad_norm=1.0`), bf16 + gradient checkpointing, FlexAttention, 1 epoch (~14,277 steps). Trained on **5 × GB300 nodes** (2 nodes FSDP2 draft training, dp=8, global batch 32 + 3 nodes vLLM TP=4 target inference). EAGLE3 aux hidden states from target layers (2, 30, 57) + the final layer. Embedding / LM head / final norm are shared from the target (M3 is a VL model, so these live under the `language_model.*` prefix).
43
+
44
+ **Core training command** — `torchspec.train_entry` spawns the FSDP2 trainer and vLLM inference engines as decoupled Ray actors, streaming hidden states through Mooncake:
45
+
46
+ ```bash
47
+ python3 -m torchspec.train_entry \
48
+ --config configs/vllm_minimax_m3_mix2.yaml \
49
+ model.draft_model_config=configs/draft_models/minimax_m3_eagle3.json \
50
+ training.training_num_nodes=2 \
51
+ training.training_num_gpus_per_node=4 \
52
+ inference.inference_num_gpus=12 \
53
+ inference.inference_num_gpus_per_engine=4 \
54
+ inference.vllm.tp_size=4
55
+ ```
56
+
57
+ Draft architecture, TTT depth, sequence length, cluster layout, and optimizer are all YAML-configurable — retargeting or scaling is a config change. See the [TorchSpec repo](https://github.com/lightseekorg/TorchSpec) for full customization instructions.
58
+
59
+ ---
60
+
61
+ ## Quick Start
62
+
63
+ ### Requirements
64
+
65
+ - NVIDIA Blackwell GPU (tested on B300), CUDA 13.0+ toolkit available.
66
+ - A vLLM build with MiniMax-M3 + EAGLE3 speculative-decoding support.
67
+
68
+ ### Launch Server (vLLM)
69
+
70
+ ```bash
71
+ vllm serve Inferact/minimax-m3-final \
72
+ --tensor-parallel-size 4 \
73
+ --gpu-memory-utilization 0.90 \
74
+ --max-model-len 65536 \
75
+ --block-size 128 \
76
+ --enforce-eager \
77
+ --no-enable-prefix-caching \
78
+ --speculative-config '{"method": "eagle3", "model": "Inferact/MiniMax-M3-EAGLE3", "num_speculative_tokens": 3, "attention_backend": "FLASH_ATTN"}'
79
+ ```
80
+
81
+ ### Run Benchmarks
82
+
83
+ ```bash
84
+ vllm-bench \
85
+ --backend openai-chat \
86
+ --base-url http://127.0.0.1:8000 \
87
+ --model Inferact/minimax-m3-final \
88
+ --dataset-name speed-bench \
89
+ --speed-bench-config throughput_16k \
90
+ --speed-bench-max-input-len 10240 \
91
+ --speed-bench-category low_entropy \
92
+ --num-warmups 5 \
93
+ --num-prompts 1000 \
94
+ --output-len 1536 \
95
+ --sweep-max-concurrency 64 \
96
+ --sweep-num-prompts-factor 1 \
97
+ --save-result
98
+ ```
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:425b923f145ea377a9436831c237d090a72ac79bf77216c7a99574f35880a057
3
  size 6527473392
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3e1f32dc942bd49bd19bce54518eeeddda48b32070ea83fb9cd5d4787c185412
3
  size 6527473392