Add pipeline tag, library metadata, and research paper link
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,15 +1,19 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
datasets:
|
| 4 |
-
- HuggingFaceH4/ultrachat_200k
|
| 5 |
base_model:
|
| 6 |
- Qwen/Qwen2.5-32B-Instruct
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
---
|
|
|
|
| 8 |
# Qwen2.5-32B-Instruct_EAGLE3_UltraChat
|
| 9 |
|
|
|
|
|
|
|
| 10 |
### Introduction
|
| 11 |
-
**Qwen2.5-32B-Instruct_EAGLE3_UltraChat** is trained based on the open-source Qwen2.5-32B-Instruct model using the [SpecForge](https://github.com/sgl-project/SpecForge) framework,
|
| 12 |
-
and can be used for the Eagle-3 speculative decoding algorithm to speed up the inference of large language models during the decoding stage.
|
| 13 |
|
| 14 |
|
| 15 |
### Training Configuration
|
|
@@ -17,8 +21,8 @@ We adopted the default training hyperparameters in SpecForge and trained EAGLE-3
|
|
| 17 |
|
| 18 |
This model checkpoint is obtained after five epochs of training ($\sim$260k training steps with bs=4). We find that even though further training improves training-time accuracy, they have a negligible impact on the end-to-end speedup of EAGLE-3.
|
| 19 |
|
| 20 |
-
- Dataset: Utilized the UltraChat-200K dataset.
|
| 21 |
-
- Training environment: The training was conducted on 4 NVIDIA H100 GPUs with 80 GB VRAM each, leveraging the DeepSpeed framework. Each training epoch took approximately 3.5 hours.
|
| 22 |
|
| 23 |
### Model Inference Launch Command
|
| 24 |
|
|
@@ -27,7 +31,7 @@ To launch the EAGLE-3 algorithm service using vLLM, here is the instruction:
|
|
| 27 |
vllm serve Qwen/Qwen2.5-32B-Instruct \
|
| 28 |
--dtype auto -tp 2 --max_model_len 2048 \
|
| 29 |
--gpu-memory-utilization 0.8 --port 30000 \
|
| 30 |
-
--speculative_config '{"model": "/
|
| 31 |
```
|
| 32 |
|
| 33 |
To launch vanilla decoding, our performance baseline, here is the instruction:
|
|
@@ -48,10 +52,19 @@ We run our evaluations on two NVIDIA A6000-48GB GPUs connected via PCIe 4.0 x16.
|
|
| 48 |
| **Qwen2.5-7B-Instruct** | 2.19x | 2.05x | 2.02x | 1.78x | 2.25x | **2.06x** |
|
| 49 |
|
| 50 |
|
| 51 |
-
###
|
| 52 |
|
| 53 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
-
|
| 56 |
|
| 57 |
-
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen2.5-32B-Instruct
|
| 4 |
+
datasets:
|
| 5 |
+
- HuggingFaceH4/ultrachat_200k
|
| 6 |
+
license: apache-2.0
|
| 7 |
+
pipeline_tag: text-generation
|
| 8 |
+
library_name: vllm
|
| 9 |
---
|
| 10 |
+
|
| 11 |
# Qwen2.5-32B-Instruct_EAGLE3_UltraChat
|
| 12 |
|
| 13 |
+
This model is an EAGLE-3 drafter for [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct), introduced in the paper [Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs](https://huggingface.co/papers/2512.20573).
|
| 14 |
+
|
| 15 |
### Introduction
|
| 16 |
+
**Qwen2.5-32B-Instruct_EAGLE3_UltraChat** is trained based on the open-source Qwen2.5-32B-Instruct model using the [SpecForge](https://github.com/sgl-project/SpecForge) framework, and can be used for the Eagle-3 speculative decoding algorithm to speed up the inference of large language models during the decoding stage.
|
|
|
|
| 17 |
|
| 18 |
|
| 19 |
### Training Configuration
|
|
|
|
| 21 |
|
| 22 |
This model checkpoint is obtained after five epochs of training ($\sim$260k training steps with bs=4). We find that even though further training improves training-time accuracy, they have a negligible impact on the end-to-end speedup of EAGLE-3.
|
| 23 |
|
| 24 |
+
- **Dataset**: Utilized the [UltraChat-200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset.
|
| 25 |
+
- **Training environment**: The training was conducted on 4 NVIDIA H100 GPUs with 80 GB VRAM each, leveraging the DeepSpeed framework. Each training epoch took approximately 3.5 hours.
|
| 26 |
|
| 27 |
### Model Inference Launch Command
|
| 28 |
|
|
|
|
| 31 |
vllm serve Qwen/Qwen2.5-32B-Instruct \
|
| 32 |
--dtype auto -tp 2 --max_model_len 2048 \
|
| 33 |
--gpu-memory-utilization 0.8 --port 30000 \
|
| 34 |
+
--speculative_config '{"model": "ruipeterpan/Qwen2.5-32B-Instruct_EAGLE3_UltraChat", "draft_tensor_parallel_size": 1, "num_speculative_tokens": 5, "method": "eagle3"}'
|
| 35 |
```
|
| 36 |
|
| 37 |
To launch vanilla decoding, our performance baseline, here is the instruction:
|
|
|
|
| 52 |
| **Qwen2.5-7B-Instruct** | 2.19x | 2.05x | 2.02x | 1.78x | 2.25x | **2.06x** |
|
| 53 |
|
| 54 |
|
| 55 |
+
### Citation
|
| 56 |
|
| 57 |
+
```bibtex
|
| 58 |
+
@article{pan2025failfast,
|
| 59 |
+
title={Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs},
|
| 60 |
+
author={Pan, Rui and Chen, Zhuofu and Liu, Hongyi and Krishnamurthy, Arvind and Netravali, Ravi},
|
| 61 |
+
journal={arXiv preprint arXiv:2512.20573},
|
| 62 |
+
year={2025}
|
| 63 |
+
}
|
| 64 |
+
```
|
| 65 |
|
| 66 |
+
### Relevant Links
|
| 67 |
|
| 68 |
+
- **GitHub Repository**: [ruipeterpan/failfast](https://github.com/ruipeterpan/failfast)
|
| 69 |
+
- **Paper**: [arXiv:2512.20573](https://arxiv.org/abs/2512.20573)
|
| 70 |
+
- **Qwen2.5-32B-Instruct Open-source Weights**: [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)
|