Add pipeline tag, library metadata, and research paper link

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +25 -12
README.md CHANGED
@@ -1,15 +1,19 @@
1
  ---
2
- license: apache-2.0
3
- datasets:
4
- - HuggingFaceH4/ultrachat_200k
5
  base_model:
6
  - Qwen/Qwen2.5-32B-Instruct
 
 
 
 
 
7
  ---
 
8
  # Qwen2.5-32B-Instruct_EAGLE3_UltraChat
9
 
 
 
10
  ### Introduction
11
- **Qwen2.5-32B-Instruct_EAGLE3_UltraChat** is trained based on the open-source Qwen2.5-32B-Instruct model using the [SpecForge](https://github.com/sgl-project/SpecForge) framework,
12
- and can be used for the Eagle-3 speculative decoding algorithm to speed up the inference of large language models during the decoding stage.
13
 
14
 
15
  ### Training Configuration
@@ -17,8 +21,8 @@ We adopted the default training hyperparameters in SpecForge and trained EAGLE-3
17
 
18
  This model checkpoint is obtained after five epochs of training ($\sim$260k training steps with bs=4). We find that even though further training improves training-time accuracy, they have a negligible impact on the end-to-end speedup of EAGLE-3.
19
 
20
- - Dataset: Utilized the UltraChat-200K dataset.
21
- - Training environment: The training was conducted on 4 NVIDIA H100 GPUs with 80 GB VRAM each, leveraging the DeepSpeed framework. Each training epoch took approximately 3.5 hours.
22
 
23
  ### Model Inference Launch Command
24
 
@@ -27,7 +31,7 @@ To launch the EAGLE-3 algorithm service using vLLM, here is the instruction:
27
  vllm serve Qwen/Qwen2.5-32B-Instruct \
28
  --dtype auto -tp 2 --max_model_len 2048 \
29
  --gpu-memory-utilization 0.8 --port 30000 \
30
- --speculative_config '{"model": "/PATH/TO/EAGLE/WEIGHTS", "draft_tensor_parallel_size": 1, "num_speculative_tokens": 5, "method": "eagle3"}'
31
  ```
32
 
33
  To launch vanilla decoding, our performance baseline, here is the instruction:
@@ -48,10 +52,19 @@ We run our evaluations on two NVIDIA A6000-48GB GPUs connected via PCIe 4.0 x16.
48
  | **Qwen2.5-7B-Instruct** | 2.19x | 2.05x | 2.02x | 1.78x | 2.25x | **2.06x** |
49
 
50
 
51
- ### Relevant Link
52
 
53
- Qwen2.5-32B-Instruct Open-source Weights: https://huggingface.co/Qwen/Qwen2.5-32B-Instruct
 
 
 
 
 
 
 
54
 
55
- "Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs" [arXiv '25]: https://arxiv.org/pdf/2512.20573
56
 
57
- Artifact of FailFast: https://github.com/ruipeterpan/failfast
 
 
 
1
  ---
 
 
 
2
  base_model:
3
  - Qwen/Qwen2.5-32B-Instruct
4
+ datasets:
5
+ - HuggingFaceH4/ultrachat_200k
6
+ license: apache-2.0
7
+ pipeline_tag: text-generation
8
+ library_name: vllm
9
  ---
10
+
11
  # Qwen2.5-32B-Instruct_EAGLE3_UltraChat
12
 
13
+ This model is an EAGLE-3 drafter for [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct), introduced in the paper [Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs](https://huggingface.co/papers/2512.20573).
14
+
15
  ### Introduction
16
+ **Qwen2.5-32B-Instruct_EAGLE3_UltraChat** is trained based on the open-source Qwen2.5-32B-Instruct model using the [SpecForge](https://github.com/sgl-project/SpecForge) framework, and can be used for the Eagle-3 speculative decoding algorithm to speed up the inference of large language models during the decoding stage.
 
17
 
18
 
19
  ### Training Configuration
 
21
 
22
  This model checkpoint is obtained after five epochs of training ($\sim$260k training steps with bs=4). We find that even though further training improves training-time accuracy, they have a negligible impact on the end-to-end speedup of EAGLE-3.
23
 
24
+ - **Dataset**: Utilized the [UltraChat-200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset.
25
+ - **Training environment**: The training was conducted on 4 NVIDIA H100 GPUs with 80 GB VRAM each, leveraging the DeepSpeed framework. Each training epoch took approximately 3.5 hours.
26
 
27
  ### Model Inference Launch Command
28
 
 
31
  vllm serve Qwen/Qwen2.5-32B-Instruct \
32
  --dtype auto -tp 2 --max_model_len 2048 \
33
  --gpu-memory-utilization 0.8 --port 30000 \
34
+ --speculative_config '{"model": "ruipeterpan/Qwen2.5-32B-Instruct_EAGLE3_UltraChat", "draft_tensor_parallel_size": 1, "num_speculative_tokens": 5, "method": "eagle3"}'
35
  ```
36
 
37
  To launch vanilla decoding, our performance baseline, here is the instruction:
 
52
  | **Qwen2.5-7B-Instruct** | 2.19x | 2.05x | 2.02x | 1.78x | 2.25x | **2.06x** |
53
 
54
 
55
+ ### Citation
56
 
57
+ ```bibtex
58
+ @article{pan2025failfast,
59
+ title={Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs},
60
+ author={Pan, Rui and Chen, Zhuofu and Liu, Hongyi and Krishnamurthy, Arvind and Netravali, Ravi},
61
+ journal={arXiv preprint arXiv:2512.20573},
62
+ year={2025}
63
+ }
64
+ ```
65
 
66
+ ### Relevant Links
67
 
68
+ - **GitHub Repository**: [ruipeterpan/failfast](https://github.com/ruipeterpan/failfast)
69
+ - **Paper**: [arXiv:2512.20573](https://arxiv.org/abs/2512.20573)
70
+ - **Qwen2.5-32B-Instruct Open-source Weights**: [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)